RTX
RTX copied to clipboard
Make synonymizer/kg2c builds more robust/transparent
there are a number of relatively small improvements that would go a long ways to making synonymizer (and kg2c) builds run smoother:
- [x] get rid of synonymizer test build? (more trouble than it's worth)
- [x] for kg2c test build, don't overwrite files; use distinct names (i.e., '_TEST')
- [x] improve method for generating test kg2pre files during test kg2c build (make sure doesn't produce orphan edges)
- [x] fix issue where synonymizer build log isn't saved to disk..
- [x] check the kg2pre version in the kg2pre TSV files used and throw an error if doesn't match requested (in synonymizer build; already done in kg2c build) (actually was already done in synonymizer build too)
- [x] possibly make user confirm parameters/config settings at very beginning of kg2c build
- [x] possibly tweak things to get rid of the temp config_dbs changes?
- [x] only use SRI NN API (not bulk download)
- [x] use
drug_chemical_conflate=trueflag when querying SRI NN - [x] verify biolink version in local KG2pre TSVs matches requested
- [ ] add some high level stats to synonymizer.py interface (total num identifiers, clusters, which file is being used..)
- [ ] verify build node in synonymizer matches requested version (unless given a synonymizer override I guess)
- [x] make it easy to run test suite right from synoymizer build directory
- [x] add some basic automated tests to run after synonymizer build
- [ ] add some basic automated tests to run after kg2c build
- [ ] update readme/documentation in light of all these changes
- [ ] write errors to logging before raising error
a bit more complex:
- [ ] compare the reports for the current build to those of prior build(s) (on arax-databases.rtx.ai) to flag changes
(many of these ideas came out of a chat with @sundareswarpullela on May 22, 2024)
another thing to add: when calling the SRI NN API during the synonymizer build, I think we want to be using the drug_chemical_conflate=true flag, which allows, for instance, 'Tylenol' to be clustered with 'Acetaminophen'