RTX icon indicating copy to clipboard operation
RTX copied to clipboard

Make synonymizer/kg2c builds more robust/transparent

Open amykglen opened this issue 1 year ago • 1 comments

there are a number of relatively small improvements that would go a long ways to making synonymizer (and kg2c) builds run smoother:

  • [x] get rid of synonymizer test build? (more trouble than it's worth)
  • [x] for kg2c test build, don't overwrite files; use distinct names (i.e., '_TEST')
  • [x] improve method for generating test kg2pre files during test kg2c build (make sure doesn't produce orphan edges)
  • [x] fix issue where synonymizer build log isn't saved to disk..
  • [x] check the kg2pre version in the kg2pre TSV files used and throw an error if doesn't match requested (in synonymizer build; already done in kg2c build) (actually was already done in synonymizer build too)
  • [x] possibly make user confirm parameters/config settings at very beginning of kg2c build
  • [x] possibly tweak things to get rid of the temp config_dbs changes?
  • [x] only use SRI NN API (not bulk download)
  • [x] use drug_chemical_conflate=true flag when querying SRI NN
  • [x] verify biolink version in local KG2pre TSVs matches requested
  • [ ] add some high level stats to synonymizer.py interface (total num identifiers, clusters, which file is being used..)
  • [ ] verify build node in synonymizer matches requested version (unless given a synonymizer override I guess)
  • [x] make it easy to run test suite right from synoymizer build directory
  • [x] add some basic automated tests to run after synonymizer build
  • [ ] add some basic automated tests to run after kg2c build
  • [ ] update readme/documentation in light of all these changes
  • [ ] write errors to logging before raising error

a bit more complex:

  • [ ] compare the reports for the current build to those of prior build(s) (on arax-databases.rtx.ai) to flag changes

(many of these ideas came out of a chat with @sundareswarpullela on May 22, 2024)

amykglen avatar May 22 '24 18:05 amykglen

another thing to add: when calling the SRI NN API during the synonymizer build, I think we want to be using the drug_chemical_conflate=true flag, which allows, for instance, 'Tylenol' to be clustered with 'Acetaminophen'

amykglen avatar Jul 11 '24 17:07 amykglen