Category: A1; Team name: neervana; Dataset: MoleculeNet
Checklist
- [x] My pull request has a clear and explanatory title.
- [x] My pull request passes the Linting test.
- [x] I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
- [x] My PR follows PEP8 guidelines. (refer to comment below)
- [x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
- [x] I linked to issues and PRs that are relevant to this PR.
Description
This PR adds support for the MoleculeNet Datasets, from the paper "MoleculeNet: A Benchmark for Molecular Machine Learning". The result is a wrapper of the original TorchGeometric implementation, together with a comprehensive test suite that ensures the correct loading of all of the datasets (more on this below).
The datasets span different domains, including Quantum Mechanics, Physical Chemistry, Biophysics, and Physiology (see https://moleculenet.org/datasets-1).
Broken SMILES parsing
The SMILES representations provided in some of the datasets are broken in a very limited number of graphs, this is a problem that already happens in the original TorchGeometric implementation (see see https://github.com/deepchem/deepchem/issues/2336).
Since our test suite is supposed to ascertain the correct loading of graphs, sometimes this causes the tests to fail if the number of loaded graphs is less than the nominal one, although this is a flaw of the dataset (the same bug happens in PyG, even though they report the nominal amount of graphs, and not the broken one). We have therefore manually annotated a min_graphs attribute for each of the pathological cases, ensuring that the tests pass if the number of failures is at most that of today, guarding against regressions in PyG's behavior.
For better reproducibility (to avoid incomparable results after an eventual fix), something should be done to either fix the SMILES, or settle for this "pruned" version of the dataset (adopt the current num_graphs as nominal).
Furthermore, some of the tests load large graphs (MUV, HIV and PCBA, from tens to hundred of thousands of nodes). I have introduced a "slow" annotation for these tests so that the CI/CD can be updated to run pytest test/data/load/test_moleculenet.py -m "not slow", excluding these tests if needed.
The total runtime (on my machine) of the full test suite is 6 minutes with no dataset cache.
Dear Challenge Participants,
As we approach the final deadline, we kindly ask you to verify that all tests are passing on your submission. This will ensure that your contribution is valid and can be reviewed.
I have added rdkit to pyproject.toml as it is a necessary dependency to process the SMILES strings in the MoleculeNet dataset.
Hi @dario-loi! After taking a quick look at the tests error (the shape mismatch), I'd recommend to ensure you have the latest changes from TopoBench main in your local branch. Once synced, probably the best strategy to solve the issue would be for your dataset loader to return the correct shape--as changing the evaluation pipeline can potentially affect existing ones.
Edit: Great, I saw this is just what you did!
I should have handled it, also, there was a minor bug in the topoligical lifting code that caused a crash, so I added a small dtype casting on the fly if there is a dtype misalignment, everything should be (hopefully) solved
@gbg141 Is the extra test suite I wrote for MoleculeNet worth it, or should we stick to test/data/load/test_datasetloaders.py, which is already loading the various sub-datsets from the .yaml? I wouldn't want to introduce redundant tests to the library
Sorry for the delay in my response, @dario-loi ! No worries at all about the extra tests, for sure they won't penalize even if they were redundant. We will analyze them in more detail once we merge your branch into main. Thanks!