MoleculeX
MoleculeX copied to clipboard
String models exploit biases in MoleculeNet SMILES dialect to inflate performance
Following up on a conversation with Meng Liu, I wanted to link this bug. I confirmed it for ClinTox, but it may be present for other datasets: https://github.com/deepchem/moleculenet/issues/15
One set of solutions would be:
- Refactoring input parsing code to be shared across models
- Adding smiles canonicalization to input parsing:
from rdkit import Chem; Chem.MolToSmiles(Chem.MolFromSmiles(smiles), canonical=True) - Re-running string-based models on all benchmarks