PDB load_one issue with atom type CL
When reading in a dimer of Indomethacin, whose formula is C19H16ClNO4, the chlorine is interpreted as a carbon. The PDB being read in was generated using GROMACS and is at the bottom of the message. Atom 15 is chlorine.
When I load this pdb using iodata and output the atom numbers I get the following
>>>from iodata import load_one
>>>mol = load_one("indo.pdb")
>>> print(mol.atnums)
[6 6 6 6 6 7 6 8 6 6 6 6 6 6 6 6 6 6 6 8 6 6 6 8 8 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 6 6 6 6 6 7 6 8 6 6 6 6 6 6 6 6 6 6 6 8 6 6 6 8 8 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1]
So we can see that atom 15 (and atom 56) has been read in as a carbon atom. I am not sure I blame load_one though, since the CL has a capital L. This is due to PSI4 outputting the optimized monomer geometry in a previous step with atom types all capitals, it then ripples down the workflow.
It is probably easy for us to parse the atomtypes in such a way as to detect CL as Cl rather than C as a favor to users, even though the fault I would say doesn't lie with iodata, but whatever caused CL to be CL rather than Cl in the first place.
Should we be concerned about this, or leave it as the responsibility of the user to pass better labelled pdb's to iodata?
TITLE indomethacin_indomethacin
REMARK THIS IS A SIMULATION BOX
CRYST1 3000.000 3000.000 3000.000 90.00 90.00 90.00 P 1 1
MODEL 1
ATOM 1 C1 XXX 1 1002.3891002.1441003.016 1.00 0.00
ATOM 2 C2 XXX 1 1001.0891001.9451002.264 1.00 0.00
ATOM 3 C3 XXX 1 999.8401002.0381002.867 1.00 0.00
ATOM 4 C4 XXX 1 998.9231001.8711001.764 1.00 0.00
ATOM 5 C5 XXX 1 999.6481001.6901000.605 1.00 0.00
ATOM 6 N6 XXX 1 1001.0471001.7681000.868 1.00 0.00
ATOM 7 C7 XXX 1 1002.1311001.735 999.982 1.00 0.00
ATOM 8 O8 XXX 1 1003.2631001.4431000.335 1.00 0.00
ATOM 9 C9 XXX 1 1001.8731002.152 998.556 1.00 0.00
ATOM 10 C10 XXX 1 1002.4821001.445 997.502 1.00 0.00
ATOM 11 C11 XXX 1 1002.2311001.801 996.167 1.00 0.00
ATOM 12 C12 XXX 1 1001.3961002.889 995.878 1.00 0.00
ATOM 13 C13 XXX 1 1000.8251003.634 996.921 1.00 0.00
ATOM 14 C14 XXX 1 1001.0701003.275 998.257 1.00 0.00
ATOM 15 CL15 XXX 1 1001.0771003.328 994.215 1.00 0.00
ATOM 16 C16 XXX 1 998.9921001.379 999.401 1.00 0.00
ATOM 17 C17 XXX 1 997.5841001.355 999.390 1.00 0.00
ATOM 18 C18 XXX 1 996.8471001.5931000.568 1.00 0.00
ATOM 19 C19 XXX 1 997.5241001.8391001.783 1.00 0.00
ATOM 20 O20 XXX 1 995.4701001.5461000.491 1.00 0.00
ATOM 21 C21 XXX 1 994.6241001.7061001.643 1.00 0.00
ATOM 22 C22 XXX 1 999.5591002.2611004.338 1.00 0.00
ATOM 23 C23 XXX 1 999.6311003.7241004.772 1.00 0.00
ATOM 24 O24 XXX 1 1000.6441004.3881004.825 1.00 0.00
ATOM 25 O25 XXX 1 998.4921004.2351005.231 1.00 0.00
ATOM 26 H26 XXX 1 1003.0141001.2471002.966 1.00 0.00
ATOM 27 H27 XXX 1 1002.2421002.3781004.075 1.00 0.00
ATOM 28 H28 XXX 1 1002.9601002.9751002.585 1.00 0.00
ATOM 29 H29 XXX 1 1003.1441000.613 997.724 1.00 0.00
ATOM 30 H30 XXX 1 1002.6821001.232 995.359 1.00 0.00
ATOM 31 H31 XXX 1 1000.1901004.486 996.692 1.00 0.00
ATOM 32 H32 XXX 1 1000.6211003.856 999.061 1.00 0.00
ATOM 33 H33 XXX 1 999.5391001.142 998.495 1.00 0.00
ATOM 34 H34 XXX 1 997.0531001.139 998.468 1.00 0.00
ATOM 35 H35 XXX 1 996.9941001.9891002.715 1.00 0.00
ATOM 36 H36 XXX 1 994.7821002.6911002.100 1.00 0.00
ATOM 37 H37 XXX 1 994.8381000.9261002.384 1.00 0.00
ATOM 38 H38 XXX 1 993.5711001.6251001.346 1.00 0.00
ATOM 39 H39 XXX 1 1000.2501001.6871004.964 1.00 0.00
ATOM 40 H40 XXX 1 998.5591001.8821004.580 1.00 0.00
ATOM 41 H41 XXX 1 998.8251005.1191005.462 1.00 0.00
ATOM 42 C1 XXX 2 1000.227 998.3851002.728 1.00 0.00
ATOM 43 C2 XXX 2 1000.499 998.3121001.240 1.00 0.00
ATOM 44 C3 XXX 2 1001.780 998.2301000.707 1.00 0.00
ATOM 45 C4 XXX 2 1001.547 998.162 999.286 1.00 0.00
ATOM 46 C5 XXX 2 1000.190 998.227 999.053 1.00 0.00
ATOM 47 N6 XXX 2 999.467 998.2451000.284 1.00 0.00
ATOM 48 C7 XXX 2 998.090 998.1271000.528 1.00 0.00
ATOM 49 O8 XXX 2 997.577 998.3871001.604 1.00 0.00
ATOM 50 C9 XXX 2 997.233 997.574 999.418 1.00 0.00
ATOM 51 C10 XXX 2 996.008 998.194 999.106 1.00 0.00
ATOM 52 C11 XXX 2 995.197 997.685 998.078 1.00 0.00
ATOM 53 C12 XXX 2 995.597 996.541 997.372 1.00 0.00
ATOM 54 C13 XXX 2 996.796 995.893 997.701 1.00 0.00
ATOM 55 C14 XXX 2 997.608 996.399 998.730 1.00 0.00
ATOM 56 CL15 XXX 2 994.588 995.910 996.088 1.00 0.00
ATOM 57 C16 XXX 2 999.708 998.359 997.738 1.00 0.00
ATOM 58 C17 XXX 2 1000.633 998.316 996.675 1.00 0.00
ATOM 59 C18 XXX 2 1002.013 998.164 996.922 1.00 0.00
ATOM 60 C19 XXX 2 1002.484 998.098 998.251 1.00 0.00
ATOM 61 O20 XXX 2 1002.867 998.134 995.839 1.00 0.00
ATOM 62 C21 XXX 2 1004.285 997.924 995.971 1.00 0.00
ATOM 63 C22 XXX 2 1003.084 998.2691001.473 1.00 0.00
ATOM 64 C23 XXX 2 1003.474 996.9371002.105 1.00 0.00
ATOM 65 O24 XXX 2 1003.050 996.5131003.158 1.00 0.00
ATOM 66 O25 XXX 2 1004.401 996.2461001.445 1.00 0.00
ATOM 67 H26 XXX 2 999.596 999.2491002.966 1.00 0.00
ATOM 68 H27 XXX 2 1001.137 998.4801003.329 1.00 0.00
ATOM 69 H28 XXX 2 999.705 997.4841003.070 1.00 0.00
ATOM 70 H29 XXX 2 995.688 999.071 999.665 1.00 0.00
ATOM 71 H30 XXX 2 994.258 998.172 997.832 1.00 0.00
ATOM 72 H31 XXX 2 997.094 994.998 997.161 1.00 0.00
ATOM 73 H32 XXX 2 998.535 995.890 998.986 1.00 0.00
ATOM 74 H33 XXX 2 998.655 998.510 997.524 1.00 0.00
ATOM 75 H34 XXX 2 1000.287 998.404 995.650 1.00 0.00
ATOM 76 H35 XXX 2 1003.540 998.024 998.484 1.00 0.00
ATOM 77 H36 XXX 2 1004.483 996.964 996.465 1.00 0.00
ATOM 78 H37 XXX 2 1004.738 998.732 996.559 1.00 0.00
ATOM 79 H38 XXX 2 1004.753 997.909 994.980 1.00 0.00
ATOM 80 H39 XXX 2 1003.053 999.0221002.265 1.00 0.00
ATOM 81 H40 XXX 2 1003.896 998.5821000.806 1.00 0.00
ATOM 82 H41 XXX 2 1004.458 995.4871002.050 1.00 0.00
TER
ENDMDL
My two cents: it's an easy fix so we should support it.
I'll make a PR for this. There is some unrelated work for the PR to pass, which takes some cleaning.
@BradenDKelly I encountered the problem a few weeks ago. This happens because iodata first tries to identify the element (and atomic numbers) from columns 77-78 which is empty in your PDB (it is common for PDB files not to have this common apparently, but all the ones we tested with had this column). If not available, iodata using the atom name (columns 13-16) to identify the element by grabbing its first element. So, it wrongly interprets the CL15 atom name as a C atom. See See Overview of ATOM records.
I am not sure what is a quick fix, because we might have CD atom name in a protein, which should be interpreted as C element, but what if we have a Cadmium atom in a PDB (which would have a CD atom name as well, if I am not wrong).
It is a surprising little problem that is not as easy to fix as it seems. It is unfortunate that often the last column is not present. In many ways that voids warranty. Because of the conundrum of proteins and other elements sharing the same abbrev, it makes it very difficult. Part of me certainly leans towards some kind of warning if that last column is missing with a note for the user sayimg to check it out... Kind of like how force field generating software will almost always give you parameters but sometimes the fineprint lets you know the parameters are probably better to use as random number generator seeds than as spring constants or torsion barriers.
@FarnazH To come back to this: the columns 13-16 do not necessarily say which element it is, e.g. CA can be alpha carbon, not Calcium, CD can be delta carbon, not Cadmium. There may also be weird things like XY etc. The element should be put in columns 77-78. Programs that don't follow these conventions are essentially broken. #284 fixes this as much as is reasonably possible, but there is no perfect way of dealing with broken files. I'm going to include a warning when the load function does this kind of guesswork, so users know that they should be careful, despite the fact that we try to make the best possible guess.