iodata icon indicating copy to clipboard operation
iodata copied to clipboard

PDB load_one issue with atom type CL

Open BradenDKelly opened this issue 4 years ago • 5 comments

When reading in a dimer of Indomethacin, whose formula is C19H16ClNO4, the chlorine is interpreted as a carbon. The PDB being read in was generated using GROMACS and is at the bottom of the message. Atom 15 is chlorine.

When I load this pdb using iodata and output the atom numbers I get the following

>>>from iodata import load_one
>>>mol = load_one("indo.pdb")
>>> print(mol.atnums)
[6 6 6 6 6 7 6 8 6 6 6 6 6 6 6 6 6 6 6 8 6 6 6 8 8 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 6 6 6 6 6 7 6 8 6 6 6 6 6 6 6 6 6 6 6 8 6 6 6 8 8 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1]

So we can see that atom 15 (and atom 56) has been read in as a carbon atom. I am not sure I blame load_one though, since the CL has a capital L. This is due to PSI4 outputting the optimized monomer geometry in a previous step with atom types all capitals, it then ripples down the workflow.

It is probably easy for us to parse the atomtypes in such a way as to detect CL as Cl rather than C as a favor to users, even though the fault I would say doesn't lie with iodata, but whatever caused CL to be CL rather than Cl in the first place.

Should we be concerned about this, or leave it as the responsibility of the user to pass better labelled pdb's to iodata?

TITLE     indomethacin_indomethacin
REMARK    THIS IS A SIMULATION BOX
CRYST1 3000.000 3000.000 3000.000  90.00  90.00  90.00 P 1           1
MODEL        1
ATOM      1  C1  XXX     1    1002.3891002.1441003.016  1.00  0.00
ATOM      2  C2  XXX     1    1001.0891001.9451002.264  1.00  0.00
ATOM      3  C3  XXX     1     999.8401002.0381002.867  1.00  0.00
ATOM      4  C4  XXX     1     998.9231001.8711001.764  1.00  0.00
ATOM      5  C5  XXX     1     999.6481001.6901000.605  1.00  0.00
ATOM      6  N6  XXX     1    1001.0471001.7681000.868  1.00  0.00
ATOM      7  C7  XXX     1    1002.1311001.735 999.982  1.00  0.00
ATOM      8  O8  XXX     1    1003.2631001.4431000.335  1.00  0.00
ATOM      9  C9  XXX     1    1001.8731002.152 998.556  1.00  0.00
ATOM     10  C10 XXX     1    1002.4821001.445 997.502  1.00  0.00
ATOM     11  C11 XXX     1    1002.2311001.801 996.167  1.00  0.00
ATOM     12  C12 XXX     1    1001.3961002.889 995.878  1.00  0.00
ATOM     13  C13 XXX     1    1000.8251003.634 996.921  1.00  0.00
ATOM     14  C14 XXX     1    1001.0701003.275 998.257  1.00  0.00
ATOM     15 CL15 XXX     1    1001.0771003.328 994.215  1.00  0.00
ATOM     16  C16 XXX     1     998.9921001.379 999.401  1.00  0.00
ATOM     17  C17 XXX     1     997.5841001.355 999.390  1.00  0.00
ATOM     18  C18 XXX     1     996.8471001.5931000.568  1.00  0.00
ATOM     19  C19 XXX     1     997.5241001.8391001.783  1.00  0.00
ATOM     20  O20 XXX     1     995.4701001.5461000.491  1.00  0.00
ATOM     21  C21 XXX     1     994.6241001.7061001.643  1.00  0.00
ATOM     22  C22 XXX     1     999.5591002.2611004.338  1.00  0.00
ATOM     23  C23 XXX     1     999.6311003.7241004.772  1.00  0.00
ATOM     24  O24 XXX     1    1000.6441004.3881004.825  1.00  0.00
ATOM     25  O25 XXX     1     998.4921004.2351005.231  1.00  0.00
ATOM     26  H26 XXX     1    1003.0141001.2471002.966  1.00  0.00
ATOM     27  H27 XXX     1    1002.2421002.3781004.075  1.00  0.00
ATOM     28  H28 XXX     1    1002.9601002.9751002.585  1.00  0.00
ATOM     29  H29 XXX     1    1003.1441000.613 997.724  1.00  0.00
ATOM     30  H30 XXX     1    1002.6821001.232 995.359  1.00  0.00
ATOM     31  H31 XXX     1    1000.1901004.486 996.692  1.00  0.00
ATOM     32  H32 XXX     1    1000.6211003.856 999.061  1.00  0.00
ATOM     33  H33 XXX     1     999.5391001.142 998.495  1.00  0.00
ATOM     34  H34 XXX     1     997.0531001.139 998.468  1.00  0.00
ATOM     35  H35 XXX     1     996.9941001.9891002.715  1.00  0.00
ATOM     36  H36 XXX     1     994.7821002.6911002.100  1.00  0.00
ATOM     37  H37 XXX     1     994.8381000.9261002.384  1.00  0.00
ATOM     38  H38 XXX     1     993.5711001.6251001.346  1.00  0.00
ATOM     39  H39 XXX     1    1000.2501001.6871004.964  1.00  0.00
ATOM     40  H40 XXX     1     998.5591001.8821004.580  1.00  0.00
ATOM     41  H41 XXX     1     998.8251005.1191005.462  1.00  0.00
ATOM     42  C1  XXX     2    1000.227 998.3851002.728  1.00  0.00
ATOM     43  C2  XXX     2    1000.499 998.3121001.240  1.00  0.00
ATOM     44  C3  XXX     2    1001.780 998.2301000.707  1.00  0.00
ATOM     45  C4  XXX     2    1001.547 998.162 999.286  1.00  0.00
ATOM     46  C5  XXX     2    1000.190 998.227 999.053  1.00  0.00
ATOM     47  N6  XXX     2     999.467 998.2451000.284  1.00  0.00
ATOM     48  C7  XXX     2     998.090 998.1271000.528  1.00  0.00
ATOM     49  O8  XXX     2     997.577 998.3871001.604  1.00  0.00
ATOM     50  C9  XXX     2     997.233 997.574 999.418  1.00  0.00
ATOM     51  C10 XXX     2     996.008 998.194 999.106  1.00  0.00
ATOM     52  C11 XXX     2     995.197 997.685 998.078  1.00  0.00
ATOM     53  C12 XXX     2     995.597 996.541 997.372  1.00  0.00
ATOM     54  C13 XXX     2     996.796 995.893 997.701  1.00  0.00
ATOM     55  C14 XXX     2     997.608 996.399 998.730  1.00  0.00
ATOM     56 CL15 XXX     2     994.588 995.910 996.088  1.00  0.00
ATOM     57  C16 XXX     2     999.708 998.359 997.738  1.00  0.00
ATOM     58  C17 XXX     2    1000.633 998.316 996.675  1.00  0.00
ATOM     59  C18 XXX     2    1002.013 998.164 996.922  1.00  0.00
ATOM     60  C19 XXX     2    1002.484 998.098 998.251  1.00  0.00
ATOM     61  O20 XXX     2    1002.867 998.134 995.839  1.00  0.00
ATOM     62  C21 XXX     2    1004.285 997.924 995.971  1.00  0.00
ATOM     63  C22 XXX     2    1003.084 998.2691001.473  1.00  0.00
ATOM     64  C23 XXX     2    1003.474 996.9371002.105  1.00  0.00
ATOM     65  O24 XXX     2    1003.050 996.5131003.158  1.00  0.00
ATOM     66  O25 XXX     2    1004.401 996.2461001.445  1.00  0.00
ATOM     67  H26 XXX     2     999.596 999.2491002.966  1.00  0.00
ATOM     68  H27 XXX     2    1001.137 998.4801003.329  1.00  0.00
ATOM     69  H28 XXX     2     999.705 997.4841003.070  1.00  0.00
ATOM     70  H29 XXX     2     995.688 999.071 999.665  1.00  0.00
ATOM     71  H30 XXX     2     994.258 998.172 997.832  1.00  0.00
ATOM     72  H31 XXX     2     997.094 994.998 997.161  1.00  0.00
ATOM     73  H32 XXX     2     998.535 995.890 998.986  1.00  0.00
ATOM     74  H33 XXX     2     998.655 998.510 997.524  1.00  0.00
ATOM     75  H34 XXX     2    1000.287 998.404 995.650  1.00  0.00
ATOM     76  H35 XXX     2    1003.540 998.024 998.484  1.00  0.00
ATOM     77  H36 XXX     2    1004.483 996.964 996.465  1.00  0.00
ATOM     78  H37 XXX     2    1004.738 998.732 996.559  1.00  0.00
ATOM     79  H38 XXX     2    1004.753 997.909 994.980  1.00  0.00
ATOM     80  H39 XXX     2    1003.053 999.0221002.265  1.00  0.00
ATOM     81  H40 XXX     2    1003.896 998.5821000.806  1.00  0.00
ATOM     82  H41 XXX     2    1004.458 995.4871002.050  1.00  0.00
TER
ENDMDL

BradenDKelly avatar Oct 15 '21 14:10 BradenDKelly

My two cents: it's an easy fix so we should support it.

PaulWAyers avatar Oct 16 '21 17:10 PaulWAyers

I'll make a PR for this. There is some unrelated work for the PR to pass, which takes some cleaning.

tovrstra avatar Oct 18 '21 07:10 tovrstra

@BradenDKelly I encountered the problem a few weeks ago. This happens because iodata first tries to identify the element (and atomic numbers) from columns 77-78 which is empty in your PDB (it is common for PDB files not to have this common apparently, but all the ones we tested with had this column). If not available, iodata using the atom name (columns 13-16) to identify the element by grabbing its first element. So, it wrongly interprets the CL15 atom name as a C atom. See See Overview of ATOM records.

I am not sure what is a quick fix, because we might have CD atom name in a protein, which should be interpreted as C element, but what if we have a Cadmium atom in a PDB (which would have a CD atom name as well, if I am not wrong).

FarnazH avatar Oct 18 '21 19:10 FarnazH

It is a surprising little problem that is not as easy to fix as it seems. It is unfortunate that often the last column is not present. In many ways that voids warranty. Because of the conundrum of proteins and other elements sharing the same abbrev, it makes it very difficult. Part of me certainly leans towards some kind of warning if that last column is missing with a note for the user sayimg to check it out... Kind of like how force field generating software will almost always give you parameters but sometimes the fineprint lets you know the parameters are probably better to use as random number generator seeds than as spring constants or torsion barriers.

BradenDKelly avatar Oct 19 '21 00:10 BradenDKelly

@FarnazH To come back to this: the columns 13-16 do not necessarily say which element it is, e.g. CA can be alpha carbon, not Calcium, CD can be delta carbon, not Cadmium. There may also be weird things like XY etc. The element should be put in columns 77-78. Programs that don't follow these conventions are essentially broken. #284 fixes this as much as is reasonably possible, but there is no perfect way of dealing with broken files. I'm going to include a warning when the load function does this kind of guesswork, so users know that they should be careful, despite the fact that we try to make the best possible guess.

tovrstra avatar Oct 19 '21 07:10 tovrstra