RTX icon indicating copy to clipboard operation
RTX copied to clipboard

PathFinder TRAPI request to ARAXi request conversion.

Open mohsenht opened this issue 1 year ago • 25 comments

mohsenht avatar Sep 09 '24 13:09 mohsenht

Is there a Translator-wide agreement on what the TRAPI request for Pathfinder should look like?

edeutsch avatar Sep 09 '24 14:09 edeutsch

Yes, you can find them here (https://docs.google.com/spreadsheets/d/1fxEUUOFP3X37KbLRrf7cSxkag1-kFaJN/edit?gid=1669218812#gid=1669218812)

mohsenht avatar Sep 09 '24 14:09 mohsenht

Great, thanks. I'll think about how we could hook that up.

One concern/observation. I was recently manually doing some connect-the-dots queries, and will write my observation here:

I was to connect indomethacin and DHFR If I do a one-hop direct query, I get one result with a direct link from CTD https://arax.ci.transltr.io/?r=292446

If I do a two-hop query, I get 4 results including cataract, angiodema, CRC, and DILI: https://arax.ci.transltr.io/?r=292449 I'm uncertain if these are great answers, but come from translator results

Then in if I do the new Pathfinder query using ARAXi, I get 97 results: https://arax.ci.transltr.io/?r=292582

One concern about these is that as far as I can tell none of the 97 include what is found via direct one hop and two hop queries.

There's also no ranking yet.

edeutsch avatar Sep 09 '24 16:09 edeutsch

Is it related to this issue?

mohsenht avatar Sep 09 '24 17:09 mohsenht

Pathfinder is not completed yet, it is just the first version of it.

mohsenht avatar Sep 09 '24 17:09 mohsenht

Maybe not directly related, but it is partially related in that I am trying to understand the context of the whole issue and what is currently working and not working.

My understanding is the the current Translator-agreed query_graph for this kind of Pathfinder query is this:

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "ids": [
            "CHEBI:49662"
          ]
        },
        "un": {
          "categories": [
            "biolink:NamedThing"
          ]
        },
        "n2": {
          "ids": [
            "NCBIGene:1719"
          ]
        }
      },
      "edges": {
        "e0": {
          "subject": "n0",
          "object": "un",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        },
        "e1": {
          "subject": "un",
          "object": "n2",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        },
        "e2": {
          "subject": "n0",
          "object": "n2",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        }
      }
    }
  }
}

As written, this returns zero results. I'm a bit perplexed why it returns zero results because there are both 1 hop and 2 hop results.

But in any case, what we want is to to short-circuit this format of query into custom Patherfinder ARAXi. OR potentially do 3 queries:

  • 1 hop
  • 2 hop
  • Pathfinder ARAXi
  • and combine?

edeutsch avatar Sep 09 '24 18:09 edeutsch

Documenting the current state of musings...

  1. The incoming query_graph will consist of two pinned nodes with a related_to edge between them with KT=inferred PLUS another unpinned node with related_to inferred edges between it and the two pinned nodes. That seems decided by the consortium.
  2. The query_graph_interpreter should be updated to recognize this pattern and inject option_group="directLookup" to the first edge and then option_group="Pathfinder" to the second two edges.
  3. The query will then proceed with the default ARAXi and Expand as it normally would, expanding all edges in the conventional manner.
  4. Expand and Resultify will be modified so that they no longer require one non-optional edge path. All edge paths may be optional. Each Result must still satisfy all non-optional edges if any, but if all edges are optional, then each Result must satisfy all edges of at least one option_group. Tricky.
  5. Expand will be modified so that if it encounters an option_group=Pathfinder, it will ALSO send all edges in that group and attached nodes to the special Pathfinder algorithm
  6. The Pathfinder algorithm must express its answer in terms of the 3 nodes and 2 edges given to it. Can that be done? What if there are 3 hops? Does the QG need an optional self edge on the unpinned node??

Not done musing yet...

edeutsch avatar Sep 09 '24 21:09 edeutsch

@mohsenht - here's some example code for calling Expand - in this example I'm assuming we want to do pathfinder between CHEBI:49662 and NCBIGene:1719:

    from ARAX_messenger import ARAXMessenger
    from ARAX_expander import ARAXExpander
    from ARAX_response import ARAXResponse
    sys.path.append(os.path.dirname(os.path.abspath(__file__))+"/../../UI/OpenAPI/python-flask-server/")
    from openapi_server.models.q_edge import QEdge
    from openapi_server.models.q_node import QNode

    # First create a message with the one-hop query graph
    response_1 = ARAXResponse()
    ARAXMessenger().create_envelope(response_1)  # Creates an empty envelope
    message_1 = response_1.envelope.message
    message_1.query_graph.nodes = {"n0": QNode(ids=["CHEBI:49662"]),
                                   "n1": QNode(ids=["NCBIGene:1719"])}
    message_1.query_graph.edges = {"e0": QEdge(subject="n0",
                                               object="n1",
                                               predicates=["biolink:related_to"])}
    # Then use Expand to answer the one-hop query
    expander = ARAXExpander()
    expander.apply(response_1, input_parameters={"kp_timeout": 10})
    # You can then find Expand's answers in the knowledge graph
    print(response_1.envelope.message.knowledge_graph)

    # Then we do the same thing but for a two-hop query
    response_2 = ARAXResponse()
    ARAXMessenger().create_envelope(response_2)
    message_2 = response_2.envelope.message
    message_2.query_graph.nodes = {"n0": QNode(ids=["CHEBI:49662"]),
                                   "nx": QNode(categories=["biolink:NamedThing"]),
                                   "n1": QNode(ids=["NCBIGene:1719"])}
    message_2.query_graph.edges = {"e0": QEdge(subject="n0",
                                               object="nx",
                                               predicates=["biolink:related_to"]),
                                   "e1": QEdge(subject="nx",
                                               object="n1",
                                               predicates=["biolink:related_to"])}
    # Then use Expand to answer the two-hop query
    expander.apply(response_2, input_parameters={"kp_timeout": 30})
    # You can then find Expand's answers in the knowledge graph
    print(response_2.envelope.message.knowledge_graph)

    # Note that the qnode that nodes fulfill is captured in the `qnode_keys` property - for example:
    node = response_2.envelope.message.knowledge_graph.nodes['CHEBI:49662']
    print(f"Node CHEBI:49662 fulfills qnode(s): {node.qnode_keys}")
    # And same with qedge_keys - example:
    edge = next(edge for edge in response_2.envelope.message.knowledge_graph.edges.values())
    print(f"Example edge fulfills qedge(s): {edge.qedge_keys}")

note that you can refine the kp_timeout parameter to whatever works best - this is basically the number of seconds that Expand will wait to get a response back from KPs' APIs (for each qedge). if you don't specify a timeout, it will default to two minutes, which I imagine is far too long.

let me know if you have any questions!

amykglen avatar Sep 18 '24 19:09 amykglen

Hi @edeutsch,

Is it possible to response with two different query graph in ARAX? One for direct one hop result and the other for the triangle shape results?

mohsenht avatar Sep 20 '24 16:09 mohsenht

Hi @mohsenht I may not quite understand the question, but there can only be one query graph coming into ARAX, and the same single query_graph should be repeated in the Response. We used to adjust the query_graph years ago but I think it was decided Translator-wide to stop this behavior.

edeutsch avatar Sep 20 '24 16:09 edeutsch

@edeutsch My question is: if we have only one query graph, is it possible for one result to not follow the structure of the query graph (in this case, directly connecting two nodes instead of forming a triangle shape)?

mohsenht avatar Sep 20 '24 16:09 mohsenht

@dkoslicki In the All Hands Meeting, we agreed to include one-hop and two-hop results from the expander into the PathFinder results. In this process, we'll first use the expander to generate the one-hop and two-hop results, then feed those nodes into our PathFinder algorithm to further expand and improve the results.

mohsenht avatar Sep 20 '24 16:09 mohsenht

@edeutsch My question is: if we have only one query graph, is it possible for one result to not follow the structure of the query graph (in this case, directly connecting two nodes instead of forming a triangle shape)?

I think we should try it. As far as I'm aware it does not violate the schema. But it does violate some policies that we have set forth in the past before we got here. As mentioned in my musings post above, we do have an option_group mechanism that we could try to use.

so I suggest: let's try it. There's a chance that we end up doing something different.

edeutsch avatar Sep 20 '24 16:09 edeutsch

Here is a sample DSL query:

create_message add_qnode(ids=PUBCHEM.COMPOUND:445154, key=n00) add_qnode(ids=NCBIGene:2739, key=n01) connect(action=connect_nodes, max_path_length=4) return(message=true, store=false)

mohsenht avatar Oct 09 '24 17:10 mohsenht

Here is a sample DSL query for PathFinder with Node Constraint:

create_message add_qnode(ids=PUBCHEM.COMPOUND:445154, key=n00) add_qnode(ids=NCBIGene:2739, key=n01) connect(action=connect_nodes, max_path_length=4, node_category_constraint=biolink:NamedThing) return(message=true, store=false)

mohsenht avatar Oct 09 '24 19:10 mohsenht

Hi everyone, I have committed some adjustments to the QueryGraphInterpreter that I hope will do what we want. I opted for a very light touch at first to see how it goes. Please request more adjustments if you want them.

Here's what I did:

  • If the QGI detects that any edge has knowledge_type="pathfinder", it immediately triggers pathfinder connect()
  • If the QGI detects that two or more edges have knowledge_type="inferred", it immediately triggers pathfinder connect()

I'm not certain this is really what we want, but seemed good. Notably, it is not restricted to the "triangle QG", but rather allows substantially more flexibility in recognizing a potential pathfinder query. I am hoping that this allows experimenting with some of the fancier pathfinder scenarios that were mentioned, beyond a single triangle.

I'm hoping that this will not trigger on any existing non-pathfinder queries. Clearly if the querier really uses knowledge_type="pathfinder", then they really mean it. And I don't think any normal production queries use two or more edges with knowledge_type="inferred", so that catches the "triangle QG" but also allows some additional experimentation.

Notable, also: perhaps against what we may have decided, it does NOT change the QG at all. It does not change the one-hop from "inferred" to "lookup" like we discussed. I decided that connect() can do that as easily as QGI can, so I'll leave that to connect(), giving it a bit more flexibility? Happy to change if you disagree.

Finally, the entire ARAXi program that is generated is:

connect(action=connect_nodes, max_path_length=4)

The expectation is that this runs on the existing query graph, whatever it is. Note that it may be more complicated than the default "triangle QG". But I assume that is a desirable feature, although may require some more effort.

This is merged into master but only minimally tested.

What do you think?

edeutsch avatar Oct 09 '24 21:10 edeutsch

So I tried a full test in devED: https://arax.ncats.io/devED/ pasting in the above query graph, and I see this:

...
2024-10-09T21:51:41.549822 INFO: QueryGraphInterpreter recognized query_graph as a 'pathfinder' query: triggering pathfinder subsystem.
...
2024-10-09T21:51:41.579091 INFO: Parsing input actions list
2024-10-09T21:51:41.579120 DEBUG: Parsing action: connect(action=connect_nodes, max_path_length=4)
2024-10-09T21:51:41.708016 INFO: Processing action 'connect' with parameters {'action': 'connect_nodes', 'max_path_length': '4'}
2024-10-09T21:51:41.708110 ERROR: Connect works with just two qnodes. qnode list size: 3
2024-10-09T21:51:41.708149 DEBUG: Applying Connect to Message with parameters {'action': 'connect_nodes', 'max_path_length': 4, 'qnode_keys': ['n2', 'un', 'n0'], 'node_category_constraint': ''}

and then halt.

I assume this is more or less what we want and it is up to connect() to be adjusted by @mohsenht to handle the "triangle QG" from here?

edeutsch avatar Oct 09 '24 21:10 edeutsch

@edeutsch

Thank you Eric for your help with this conversion issue.

mohsenht avatar Oct 10 '24 14:10 mohsenht

I think, we made a decision for TRAPI and the structure is almost certain.

For ARAX DSL, would it be better to get just parameters (two pinned nodes and on node constraint that every path should have at least one), or the user need to build that TRAPI structure query graph with DSL code.

Something like the below this DSL code:

create_message
add_qnode(ids=MONDO:0004975, key=pinned_node_1)
add_qnode(key=middle_node,categories=biolink:Gene)
add_qnode(ids=UMLS:C1318700, key=pinned_node_2)
add_qedge(key=direct_edge,subject=pinned_node_1,object=pinned_node_2,predicates=biolink:related_to)
add_qedge(key=pinned_node_1_to_middle_node_edge,subject=pinned_node_1,object=middle_node,predicates=biolink:related_to)
add_qedge(key=middle_node_to_pinned_node_2_edge,subject=middle_node,object=pinned_node_2,predicates=biolink:related_to)
connect(action=connect_nodes, max_path_length=4)
return(message=true, store=false)

mohsenht avatar Oct 11 '24 13:10 mohsenht

I don't think I really understand the question. I think it is a given that queries will come in via the query_graph like in the middle of this thread, e.g.:

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "ids": [
            "CHEBI:49662"
          ]
        },
        "un": {
          "categories": [
            "biolink:NamedThing"
          ]
        },
        "n2": {
          "ids": [
            "NCBIGene:1719"
          ]
        }
      },
      "edges": {
        "e0": {
          "subject": "n0",
          "object": "un",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        },
        "e1": {
          "subject": "un",
          "object": "n2",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        },
        "e2": {
          "subject": "n0",
          "object": "n2",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        }
      }
    }
  }
}

We already have that in hand. So the DSL does not need to create a message and add qnodes and qedges; they're already all there.

I think all that connect() needs to do is work with the query graph as given and create a knowledge graph and results that map back to the query graph, using its own pathfinding algorithm as well as Expand() as appropriate.

So I think the only ARAXi necessary is:

connect(action=connect_nodes, max_path_length=4)

and that is implemented and apparently working.

edeutsch avatar Oct 11 '24 19:10 edeutsch

So here's my thoughts on what connect() should be doing in this scenario:

  • Examine the input query_graph as given to it to identify what are the two pinned nodes. If there are fewer than two, it is an error. If there are more than two, this is tricky. I would error out for now, although in principle this could be a valid future query.
  • If the two pinned nodes are connected by a direct edge, then send it to Expand() to satisfy and that becomes Result 1 bound simply to the two pinned nodes and the one direct edge
  • Then, determine if there is just one intermediate node and two inferred edges. In that scenario: - You create a little mini-query graph leaving out the direct edge and removing the KT=inferred, and send that to Expand() - you also run your pathfinder algorithm - If the intermediate node is bound to a specific category, then find all the paths that go through that category, and organize the Results so that each entity of that type is one Results, and the two edges get to hold support graphs that hook them up. - if it is not bound, then somehow you need to pick the most interesting nodes and bind them to the middle query node and organize the Results that way
  • Finally, the results from the 3 strategies need to be merged: the direct Expand() (always Result 1), the 2-hop Expand(), and the pathfinder results

Now if the query graph is more complicated than just the triangle with all three edges being inferred, then some additional strategy needs to be invoked. But I would suggest putting that on the back burner for now until you can get the above working.

edeutsch avatar Oct 11 '24 19:10 edeutsch

I don't think I really understand the question. I think it is a given that queries will come in via the query_graph like in the middle of this thread, e.g.:

@edeutsch In talking with @mohsenht , I think his intention about the ARAXi question is more along the lines of "If someone were not doing a query through the ARS, but wanted to do it through the ARAX UI itself and build a DSL, what should the syntax be? Put parameters in the DSL constructed QG, or put it in the connect command itself?" With the conclusion being "put parameters in the DSL constructed QG (wrt intermediate node constraints and the like" in order to make the paradigm consistent: be it from ARAXi or from an ARS TRAPI message, we'll just call connect() and use the associated query graphs to determine constraints and the like (be they from the TRAPI QG or the DSL QG).

dkoslicki avatar Oct 11 '24 20:10 dkoslicki

AH, I get it now, sorry.

Yes, I think using the DSL to build a query graph that mirrors what would come from the ARS seems easiest. One snag could be that the edges may need to have knowledge_type=inferred tags, which I think the current DSL does not support maybe (but easily added).

In the short term, this may not be needed, but eventually if we want connect() to treat NON knowledge_type=inferred edges in a special way (only send to Expand, not subject to pathfinding) then we may need to add that distinction.

edeutsch avatar Oct 11 '24 21:10 edeutsch

Hi @edeutsch,

I apologize for the confusion my comment caused and for taking up your time. Occasionally, I leave comments as personal notes to track my thoughts and things I need to discuss with @dkoslicki.

Going forward, I'll label such comments as "Personal Note" to avoid taking up the time of the ARAX team.

mohsenht avatar Oct 11 '24 21:10 mohsenht

no problem. Your note was actually clear enough in the first place. I just misread it/misunderstood it.

edeutsch avatar Oct 11 '24 21:10 edeutsch