RTX icon indicating copy to clipboard operation
RTX copied to clipboard

Epoprostenol used to treat rats

Open edeutsch opened this issue 1 year ago • 13 comments

I was assigned this issue by TAQA: https://github.com/NCATSTranslator/Feedback/issues/707

Apparently xDTD was trained with KG2-SemMedDB that asserts that Epoprostenol is used to treat rats. And there are lots of papers describing treatment of rats with Epoprostenol. But apparently this is not an appreciated answer.

It is unclear to me whether we just want to remove such SemMedDB edges in KG2

Or whether the xDTD training data can be refined to exclude Drug-treats-X edges where X is a species.

Or whether this problem goes away on its own with the upcoming KG2 "treats" refactor. (where I assume we should make an effort to ensure that ideas like: Drug X was used to attempt to treat disease Y in species Z are NOT excoded as: Drug X treats species Z

Anyone have ideas on how to handle the TAQA issue?

edeutsch avatar Mar 01 '24 19:03 edeutsch

Bill stated it more elegantly than I did. Do we/can we employ domain and range constraints to avoid this kind of thing: https://github.com/NCATSTranslator/Feedback/issues/707#issuecomment-1974071643

edeutsch avatar Mar 01 '24 23:03 edeutsch

The KG2 API does actually filter out edges that violate such domain/range specifications, but they're still in the underlying KG2c graph, which xDTD is trained on (I think). Maybe those edges should be excluded from the graph used for training? They're easily identifiable by the domain_range_exclusion property. (There are 3.8 million such edges in KG2c - about 8% of the total edges.)

amykglen avatar Mar 04 '24 16:03 amykglen

Do we need a fix for this in the Lobster release? Hoping the answer is no, and that we can instead aim to fix this in the Octopus release?

saramsey avatar Mar 11 '24 20:03 saramsey

I'm not sure that I am informed enough to have an opinion about whether or not we should include edges with domain_range_exclusion set to True (i.e., excluded edges) in the graph used for training xDTD. But it seems like we should (somehow) ensure that ARAX isn't returning results for which the key edge basis is an excluded edge. I'm fine with the idea of adding a filter for this, if that is what people feel is best. @dkoslicki @chunyuma @amykglen what do you think?

saramsey avatar Mar 11 '24 20:03 saramsey

Hi @edeutsch and @saramsey, I think both solutions (1. use filtered KG to train xDTD; 2. add a filter to the xDTD outputs) work for this issue. However, I will say option 2 will be easier and more flexible considering the long training time of xDTD. For option 1, are we sure that the edges with domain_range_exclusion=True include all edges that we would like to be excluded for training? Or are they just a subset of them? If the domain_range_exclusion=True includes all, then we can exclude those edges in training.

chunyuma avatar Mar 12 '24 15:03 chunyuma

@amykglen what do you think?

Adding a filter seems fine to me - and I take back my statement that those edges should be removed from the training dataset specifically, ha - I don't know enough about xDTD to know whether that would make sense. But I agree with Steve that at least the results that ARAXInfer returns shouldn't include domain_range_exclusion=True edges, however it makes sense to achieve that.

For option 1, are we sure that the edges with domain_range_exclusion=True include all edges that we would like to be excluded for training? Or are they just a subset of them?

I think @saramsey or @sundareswarpullela or @acevedol know more about this than me, but from what I can tell, I think it's only SemmedDB edges that are marked as domain_range_exclusion=True (where appropriate). However, I'm guessing that SemmedDB is the main 'problem' source for edges with invalid domain/range anyway, so maybe that is sufficient?

amykglen avatar Mar 12 '24 21:03 amykglen

@chunyuma since it takes so long to re-train xDTD, what about the following path forward:

  1. Add the filter to the xDTD output
  2. As time permits, update the xDTD training code to exclude such edges. No need to do a full re-build until a new version of KG2 warrants it.

dkoslicki avatar Mar 13 '24 14:03 dkoslicki

Sure, I can add a filter to the xDTD output. Can I know where I can find the edge attribute domain_range_exclusion? I can't find it in the edges_c.tsv file of KG v2.8.4.

chunyuma avatar Mar 13 '24 17:03 chunyuma

Huh, that's weird. I see it in my copy of KG2.8.4c:

ubuntu@ip-172-31-48-160:~/plater-plover$ cat edges_c_header.tsv 
subject	object	predicate	primary_knowledge_source	publications:string[]	publications_info	kg2_ids:string[]	qualified_predicate	qualified_object_aspect	qualified_object_direction	domain_range_exclusion	id	:TYPE	:START_ID	:END_ID

Also note that currently the values for domain_range_exclusion are strings ("True" or "False"), though eventually they will be switched to actual booleans (see #2185). So you might want to set up your code to handle either strings or booleans

amykglen avatar Mar 14 '24 18:03 amykglen

Thanks @amykglen! I will check it again.

chunyuma avatar Mar 14 '24 18:03 chunyuma

Hi team,

I have already updated the xDTD database for KG2.8.4 to exclude all edges with domain_range_exclusion==True. It should now solve this issue. I tested test_ARAX_infer.py but got an error reported in issue #2252.

chunyuma avatar Mar 17 '24 21:03 chunyuma

hey @chunyuma - I just responded in #2252 about the error you're seeing

amykglen avatar Mar 19 '24 01:03 amykglen

Thanks @amykglen. Now the updated xDTD database has passed the Infer tests. I think we can verify this solution for this issue after deployment.

chunyuma avatar Mar 19 '24 14:03 chunyuma

OK to close?

saramsey avatar Jun 06 '24 20:06 saramsey

Verified working in CI and Test, so good to close

dkoslicki avatar Jun 12 '24 17:06 dkoslicki