Indigo icon indicating copy to clipboard operation
Indigo copied to clipboard

Can't seen to make Bingo NoSQL documentation work

Open AgRi0n opened this issue 1 year ago • 1 comments

Situation I have been trying to use Bingo NoSQL on python to perform search operations on a few molecules. I start with a simple .csv file with a bunch of molecules (smiles) matched with an id.

Package version Python version : Python 3.10.12 epam.indigo version : 1.19.0

Bingo.searchSim() I was following this documentation, and i realised that the searchSim doesn't actually require a query molecule, but a molecule (created with Indigo.loadMolecule()) isntead.

AgRi0n avatar Jun 20 '24 08:06 AgRi0n

1. Problem Description:

The Bingo NoSQL API documentation (https://lifescience.opensource.epam.com/bingo/user-manual-nosql.html) contains misleading or outdated instructions regarding the searchSim() method.
The documentation states that the method performs similarity searches using a "query molecule" (e.g., a SMILES string). However, the actual implementation requires a molecule object created through Indigo.loadMolecule().


2. Steps to Reproduce:

  1. Install Python 3.10.12 and the epam.indigo library version 1.19.0:
    pip install epam.indigo==1.19.0
    
  2. Prepare a .csv file with molecules in SMILES format:
    id,smiles
    1,CCO
    2,CCC
    3,CCN
    
  3. Use the Bingo NoSQL API to create a database and insert molecules:
    from bingo import Bingo
    from indigo import Indigo
    
    indigo = Indigo()
    bingo_db = Bingo.createDatabaseFile(indigo, "test.bingo", "molecule")
    
    with open("molecules.csv", "r") as file:
        next(file)  # Skip header
        for line in file:
            id, smiles = line.strip().split(',')
            mol = indigo.loadMolecule(smiles)
            bingo_db.insert(mol, id)
    
  4. Attempt to call the searchSim() method according to the documentation by passing a SMILES string directly without creating a molecule object:
    # Incorrect usage (per documentation)
    query = "CCO"
    matcher = bingo_db.searchSim(query, minSim=0.7, maxSim=1.0)
    
  5. Observe the resulting error:
    TypeError: searchSim() argument should be of type 'IndigoObject', not 'str'
    

3. Expected Behavior:

As stated in the documentation:

  • The searchSim() method should accept a "query molecule" as a string (e.g., SMILES) and perform similarity searches correctly.
  • The API should automatically convert the string into an Indigo.loadMolecule() object or allow direct use of the string without additional user steps.
  • The method should return a list of IDs of similar molecules (e.g., [1, 2]).

4. Actual Behavior:

  • The searchSim() method requires a molecule object, created using Indigo.loadMolecule().
  • Attempting to pass a string results in the following error:
    TypeError: searchSim() argument should be of type 'IndigoObject', not 'str'
    
  • The documentation provides incorrect usage examples, leading users to confusion.
  • The error occurs 100% of the time when the method is used according to the documentation without creating a molecule.

5. Analysis of the Problem:

  1. Root Cause:

    • The documentation is inaccurate or outdated. The method requires a molecule object instead of a string, which isn’t specified clearly in the user manual.
    • The API doesn’t support automatic conversion of string data (e.g., SMILES) into a molecule object, forcing users to search for alternate solutions.
  2. Affected Modules:

    • Documentation: Contains outdated or misleading information about input parameters for the searchSim() method.
    • Bingo NoSQL API: Lacks mechanisms for input type validation or conversion of strings into molecule objects.
  3. Lifescience Context:

    • The bug could limit users' ability to utilize the library for molecular data analysis. This diminishes trust in the library, especially in chemical and biological fields where accuracy is critical.

6. Suggested Solutions:

High-Level Solution:

  • Update Documentation:
    • Rewrite the section describing the searchSim() method in the user manual, clearly specifying the need to create query molecules using Indigo.loadMolecule().
    • Provide accurate code examples for the method’s usage with the correct parameters.

Technical Solution:

  1. API Modification:
    Enable automatic conversion of SMILES strings into molecule objects within the searchSim() method:

    def searchSim(self, query, minSim, maxSim, sim_type):
        if isinstance(query, str):
            query = self.indigo.loadMolecule(query)  # Automatic conversion
        elif not isinstance(query, IndigoObject):
            raise TypeError("searchSim() argument must be IndigoObject or SMILES string")
        return self._searchSimInternal(query)
    
  2. Documentation Enhancement:
    Update the user manual with examples like:

    from bingo import Bingo
    from indigo import Indigo
    
    indigo = Indigo()
    bingo_db = Bingo.loadDatabaseFile(indigo, "test.bingo")
    
    query_smiles = "CCO"  # SMILES string
    query_mol = indigo.loadMolecule(query_smiles)  # Create Molecule object
    matcher = bingo_db.searchSim(query_mol, minSim=0.7, maxSim=1.0)
    
    while matcher.next():
        print(f"ID: {matcher.getCurrentId()}, Similarity: {matcher.getCurrentSimilarityValue()}")
    matcher.close()
    
  3. Error Handling Enhancements:
    Add comprehensible error messages for invalid input:

    searchSim() expected an Indigo Molecule object. Use Indigo.loadMolecule() to create one from SMILES.
    

mobilisf avatar Jun 09 '25 12:06 mobilisf