String models exploit biases in MoleculeNet SMILES dialect to inflate performance

Open cyrusmaher opened this issue 5 years ago • 0 comments

Following up on a conversation with Meng Liu, I wanted to link this bug. I confirmed it for ClinTox, but it may be present for other datasets: https://github.com/deepchem/moleculenet/issues/15

One set of solutions would be:

Refactoring input parsing code to be shared across models
Adding smiles canonicalization to input parsing: from rdkit import Chem; Chem.MolToSmiles(Chem.MolFromSmiles(smiles), canonical=True)
Re-running string-based models on all benchmarks

Dec 16 '20 18:12 cyrusmaher