ProDy Decomposing complex when ligand is a standard residue or protein/RNA/DNA containing HETATM.

I want to decompose a complex into protein, nucleic acid and ligands. If I select "protein and not hetatm" for protein "nucleic and not hetatm" for nucleic and "hetatm and (not protein) and (not nucleic)", I face these problems:

the HETATM residues in protein or DNA/RNA are not chosen.
the ligands include HETATM residues in protein or DNA/RNA
the ligands that are standard residues (standard aa or standard (de)nucleotide) are excluded.

For example: 1f7u has protein (chain A), RNA (chain B) and 2 ligands: ARG and SO4 like below:

import prody

struct = prody.parseMMCIF(1f7u, altloc="all")

# the HETATM residues in protein or DNA/RNA are not chosen. 
# RNA seq have some non-standard residues: 1MA, 1MG, 2MG, 5MC, 5MU, H2U, M2G, PSU
len(set(struct.select("nucleic and not hetatm").getResnums()))
# output: 63 (expected: 76)

# the ligands include HETATM residues in protein or DNA/RNA
set(struct.select("hetatm and (not protein) and (not water) and (not nucleic)").getResnames())
# output: {'1MA', '1MG', '2MG', '5MC', '5MU', 'H2U', 'M2G', 'PSU', 'SO4'} 
# expected: not having '1MA', '1MG', '2MG', '5MC', '5MU', 'H2U', 'M2G', 'PSU'

# the ligands that are standard residues (standard aa or standard (de)nucleotide) are excluded
set(struct.select("hetatm and (not protein) and (not water) and (not nucleic)").getResnames())
# output: {'1MA', '1MG', '2MG', '5MC', '5MU', 'H2U', 'M2G', 'PSU', 'SO4'} 
# expected: {'ARG', 'SO4'}

How to address these problems in ProDy?

Mar 26 '25 20:03 hnguyentt

There are a few things going on here:

You probably want to not use the HETATM flag for your selections. For some reason, this mmCIF file has been setup so that the non-standard residues are on HETATM lines instead of ATOM lines. It doesn't say anything about whether a residue belong to a ligand or not.
The selection macros 'protein' and 'nucleic' are for selecting types of residues. They do not relate to whether something is part of a chain. Thankfully, recent mmCIF files split chains and segments based on this so we can use that information. One way is to use the hierarchical view in ProDy. For this case, the not hetatm option works anyway.

list(struct.getHierView())
"""
[<Chain: A from Segment B from 1f7u (76 residues, 1629 atoms)>,
 <Chain: B from Segment A from 1f7u (606 residues, 4892 atoms)>,
 <Chain: C from Segment B from 1f7u (1 residues, 5 atoms)>,
 <Chain: D from Segment A from 1f7u (1 residues, 12 atoms)>,
 <Chain: E from Segment B from 1f7u (240 residues, 240 atoms)>,
 <Chain: F from Segment A from 1f7u (348 residues, 348 atoms)>]
"""

list(struct.select('protein').getHierView())
"""
[<Chain: B from Segment A from 1f7u (606 residues, 4892 atoms)>,
 <Chain: D from Segment A from 1f7u (1 residues, 12 atoms)>]
 """

list(struct.select('protein and not hetatm').getHierView())
#Out: [<Chain: B from Segment A from 1f7u (606 residues, 4892 atoms)>]

nucleics = struct.select('(' + ') or ('.join([ch.getSelstr() for ch in list(struct.getHierView()) if ch.nucleic is not None]) + ')')
#Out: <Selection: '(chain A and (segname B))' from 1f7u (1629 ato#1862 

list(nucleics.getHierView())
#Out: [<Chain: A from Segment B from 1f7u (76 residues, 1629 atoms)>#2048 

set(nucleics.getResnames())
"""Out[10]: 
{'1MA',
 '1MG',
 '2MG',
 '5MC',
 '5MU',
 'A',
 'C',
 'G',
 'H2U',
 'I',
 'M2G',
 'PSU',
 'U'}"""

ProDy currently has a bug so that non-standard residues are only taken into account for protein and not nucleic acid. I think I have a fix for this that should be ready soon.

Mar 26 '25 21:03 jamesmkrieger

Thank James!

I have another question:

ProDy 2.5.0 (dev)

list(struct.getHierView())
"""
[<Chain: A from Segment B from 1f7u (76 residues, 1629 atoms)>,
 <Chain: B from Segment A from 1f7u (606 residues, 4892 atoms)>,
 <Chain: C from Segment B from 1f7u (1 residues, 5 atoms)>,
 <Chain: D from Segment A from 1f7u (1 residues, 12 atoms)>,
 <Chain: E from Segment B from 1f7u (240 residues, 240 atoms)>,
 <Chain: F from Segment A from 1f7u (348 residues, 348 atoms)>]
"""

This means the term "chain" now becomes "segment" and "segment" becomes "chain" in ProDy 2.5.0?

ProDy 2.4.1

list(struct.getHierView())
"""
[<Chain: B from Segment A from 1f7u (76 residues, 1629 atoms)>,
 <Chain: A from Segment B from 1f7u (606 residues, 4892 atoms)>,
 <Chain: B from Segment C from 1f7u (1 residues, 5 atoms)>,
 <Chain: A from Segment D from 1f7u (1 residues, 12 atoms)>,
 <Chain: B from Segment E from 1f7u (240 residues, 240 atoms)>,
 <Chain: A from Segment F from 1f7u (348 residues, 348 atoms)>]
"""

The term "chain" in ProDy 2.4.1 stil remains consistent with the term "chain" in CIF file.

Mar 27 '25 11:03 hnguyentt

Yes, we have noticed this problem and have been meaning to revisit it

Mar 27 '25 13:03 jamesmkrieger

I think the issue you mentioned is solved anyway. It was about clarifying and I updated the help. Finding the right choice is something different.

There is a lot of inconsistencies between mmCIF files. We could even add an argument to swap the two. That's probably a good way to handle it. It's important to match what the biomolecular assembly transformation fields say, which I think is one of the factors that motivated us to swap them around.

Anyhow, this needs careful thought and I'm not up to that right now.

Mar 27 '25 16:03 jamesmkrieger