molecule-generation How to integrate custom atom features?

Hi, may I ask if I can add the atom positions in atom_feature_utils.py? If I include the coordinates, does this mean I should directly sum the features in the node feature vector from the GNN graph? I apologize for the confusion, but I am not sure where the GNN model is located or where the features get embedded. If you could help me locate where the GNN model or the feature embedding process is, I would humbly thank you

Aug 01 '24 21:08 feyhong1112

If you add an additional AtomFeatureExtractor in atom_feature_utils.py, and append it to the list of default featurizers, then that could work, but these featurizers only work on individual atoms, whereas your 3D position featurizer would need to take a look at the entire molecule to determine the conformation first.

Maybe a better approach would be to take a look at featurise_atoms in molecule_dataset_utils.py. As you can see, this function builds a NodeFeatures object by gathering features from the featurizers. At this point, the mol is in scope, so you could plug in your 3D position determination there, and then append the resulting positions to the all_atom_features list. Note that you may want to normalize (at least center) those positions before providing them, as the model would treat them like any other atom features, without any rototranslational invariance/equivariance.

I am not 100% sure everything will work out-of-the-box, but I did check that featurise_atoms is also called during decoding (see moler_decoding_utils.py), so plugging your change in that one place might just be sufficient.

Aug 02 '24 12:08 kmaziarz

Sir, Thank you so much.

Aug 06 '24 21:08 feyhong1112

Sorry to say that but what method you are use to embedded the coordinate into mol? I have complicate error with Converting graph sample for Cc1ccc(B(O)O)cc1Br failed - aborting!Converting graph sample for O=C(O)C(=O)Nc1sc2c(c1C(=O)O)CC[NH2+]C2 failed - aborting!

Aug 22 '24 10:08 feyhong1112

sorry to show my code which not very good structure format. Thank you to always help.

convert 2D or 3D SDF to mol rdkit and use smile to calculate the sa_score, clogp and others
extract the coordinate and emmbed to NodeFeature below here is my code

def featurise_atoms(
    mol: Mol,
    atom_feature_extractors: List[AtomFeatureExtractor],
    motif_vocabulary: Optional[MotifVocabulary] = None,
    motifs: List[MotifAnnotation] = [],
) -> NodeFeatures:
    
    if motif_vocabulary is not None:
        atom_type_feature_extractor = next(
            featuriser
            for featuriser in atom_feature_extractors
            if isinstance(featuriser, AtomTypeFeatureExtractor)
        )

        enclosing_motif_id: Dict[int, int] = {}
        for motif in motifs:
            motif_id = motif_vocabulary.vocabulary[motif.motif_type]

            for atom in motif.atoms:
                enclosing_motif_id[atom.atom_id] = motif_id

        num_motifs = len(motif_vocabulary.vocabulary)

        all_atom_class_ids = []
        num_atom_classes = atom_type_feature_extractor.feature_width + num_motifs
    else:
        assert not motifs

        all_atom_class_ids = None
        num_atom_classes = None

    all_atom_features = []
    for atom_id, atom in enumerate(mol.GetAtoms()):
        atom_symbol = get_atom_symbol(atom)

        atom_features = [
            atom_featuriser.featurise(atom) for atom_featuriser in atom_feature_extractors
        ]

        if motif_vocabulary is not None:
            motif_or_atom_id = enclosing_motif_id.get(
                atom_id, atom_type_feature_extractor.type_name_to_index(atom_symbol) + num_motifs
            )
            assert motif_or_atom_id < num_atom_classes
            all_atom_class_ids.append(motif_or_atom_id)

        # Ensure the molecule has a conformer
        if mol.GetNumConformers() == 0:
            AllChem.EmbedMolecule(mol)  # Generate a conformer if none exists

        # Extract coordinate
        ligand_conformer = mol.GetConformer()
        atom_coordinates = np.array([
            [ligand_conformer.GetAtomPosition(atom_id)[0], ligand_conformer.GetAtomPosition(atom_id)[1],
             ligand_conformer.GetAtomPosition(atom_id)[2]]], dtype=np.float32)
        
        atom_distances = np.sqrt(np.sum(np.square(atom_coordinates.reshape(-1, 1, 3) - atom_coordinates.reshape(1, -1, 3)), axis=-1))
        interaction = np.where(atom_distances < 4.0, 1., 0.)

        mol_interactions = [interaction.reshape(-1)]

        atom_features = np.concatenate(atom_features).astype(np.float32)
        mol_interactions = np.concatenate(mol_interactions).astype(np.float32)

        all_atom_features.append(atom_features)
        all_atom_features.append(mol_interactions)

    return NodeFeatures(
        real_valued_features=all_atom_features,
        categorical_features=all_atom_class_ids,
        num_categorical_classes=num_atom_classes,
    )

Aug 22 '24 11:08 feyhong1112

The all_atom_features list is supposed to hold one feature vector per atom, yet in your changed code you append to it twice when iterating over the atoms. If you want to add extra atom features, you would need to add them to the atom_features list (before the np.concatenate(atom_features).astype(np.float32) is called), and keep the single append to all_atom_features.

Note that this function deals with features for atoms, not atom pairs or bonds. An example atom feature would be atom_coordinates computed in your code. The atom_distances and interaction matrices you construct are for pairs of atoms, which doesn't really fit as atom features. You could start by using atom_coordinates (only the part corresponding to the particular atom that is being considered in the loop), and perhaps also some atom-level statistics of the interaction matrix (e.g. number of other atoms closer than a given distance), and see if then you can get the code to run.

Aug 22 '24 17:08 kmaziarz