MoleculeX icon indicating copy to clipboard operation
MoleculeX copied to clipboard

String models exploit biases in MoleculeNet SMILES dialect to inflate performance

Open cyrusmaher opened this issue 5 years ago • 0 comments

Following up on a conversation with Meng Liu, I wanted to link this bug. I confirmed it for ClinTox, but it may be present for other datasets: https://github.com/deepchem/moleculenet/issues/15

One set of solutions would be:

  • Refactoring input parsing code to be shared across models
  • Adding smiles canonicalization to input parsing: from rdkit import Chem; Chem.MolToSmiles(Chem.MolFromSmiles(smiles), canonical=True)
  • Re-running string-based models on all benchmarks

cyrusmaher avatar Dec 16 '20 18:12 cyrusmaher