TopoBench icon indicating copy to clipboard operation
TopoBench copied to clipboard

Category: A1; Team name: neervana; Dataset: MoleculeNet

Open dario-loi opened this issue 2 months ago • 7 comments

Checklist

  • [x] My pull request has a clear and explanatory title.
  • [x] My pull request passes the Linting test.
  • [x] I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
  • [x] My PR follows PEP8 guidelines. (refer to comment below)
  • [x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • [x] I linked to issues and PRs that are relevant to this PR.

Description

This PR adds support for the MoleculeNet Datasets, from the paper "MoleculeNet: A Benchmark for Molecular Machine Learning". The result is a wrapper of the original TorchGeometric implementation, together with a comprehensive test suite that ensures the correct loading of all of the datasets (more on this below).

The datasets span different domains, including Quantum Mechanics, Physical Chemistry, Biophysics, and Physiology (see https://moleculenet.org/datasets-1).

Broken SMILES parsing

The SMILES representations provided in some of the datasets are broken in a very limited number of graphs, this is a problem that already happens in the original TorchGeometric implementation (see see https://github.com/deepchem/deepchem/issues/2336).

Since our test suite is supposed to ascertain the correct loading of graphs, sometimes this causes the tests to fail if the number of loaded graphs is less than the nominal one, although this is a flaw of the dataset (the same bug happens in PyG, even though they report the nominal amount of graphs, and not the broken one). We have therefore manually annotated a min_graphs attribute for each of the pathological cases, ensuring that the tests pass if the number of failures is at most that of today, guarding against regressions in PyG's behavior.

For better reproducibility (to avoid incomparable results after an eventual fix), something should be done to either fix the SMILES, or settle for this "pruned" version of the dataset (adopt the current num_graphs as nominal).

dario-loi avatar Nov 14 '25 23:11 dario-loi

Furthermore, some of the tests load large graphs (MUV, HIV and PCBA, from tens to hundred of thousands of nodes). I have introduced a "slow" annotation for these tests so that the CI/CD can be updated to run pytest test/data/load/test_moleculenet.py -m "not slow", excluding these tests if needed.

The total runtime (on my machine) of the full test suite is 6 minutes with no dataset cache.

dario-loi avatar Nov 15 '25 00:11 dario-loi

Dear Challenge Participants,

As we approach the final deadline, we kindly ask you to verify that all tests are passing on your submission. This will ensure that your contribution is valid and can be reviewed.

levtelyatnikov avatar Nov 17 '25 11:11 levtelyatnikov

I have added rdkit to pyproject.toml as it is a necessary dependency to process the SMILES strings in the MoleculeNet dataset.

dario-loi avatar Nov 17 '25 12:11 dario-loi

Hi @dario-loi! After taking a quick look at the tests error (the shape mismatch), I'd recommend to ensure you have the latest changes from TopoBench main in your local branch. Once synced, probably the best strategy to solve the issue would be for your dataset loader to return the correct shape--as changing the evaluation pipeline can potentially affect existing ones.

Edit: Great, I saw this is just what you did!

gbg141 avatar Nov 17 '25 19:11 gbg141

I should have handled it, also, there was a minor bug in the topoligical lifting code that caused a crash, so I added a small dtype casting on the fly if there is a dtype misalignment, everything should be (hopefully) solved

dario-loi avatar Nov 17 '25 19:11 dario-loi

@gbg141 Is the extra test suite I wrote for MoleculeNet worth it, or should we stick to test/data/load/test_datasetloaders.py, which is already loading the various sub-datsets from the .yaml? I wouldn't want to introduce redundant tests to the library

dario-loi avatar Nov 17 '25 19:11 dario-loi

Sorry for the delay in my response, @dario-loi ! No worries at all about the extra tests, for sure they won't penalize even if they were redundant. We will analyze them in more detail once we merge your branch into main. Thanks!

gbg141 avatar Nov 19 '25 22:11 gbg141