failing to read PDB files generated by VMD
When trying to read this file (and other similar ones), I get:
julia> read("vmd.pdb", BioStructures.PDB)
ERROR: Two copies of the same atom have the same alternative location ID. Existing atom:
Atom N with serial 1, coordinates [-41.156, -40.019, 21.411]
New atom record to add:
AtomRecord N with serial 1501, coordinates [-42.523, -41.722, -19.881]
This is one example VMD pdb file.
The offending PDB lines are
ATOM 1 N POPE 1 -41.156 -40.019 21.411 1.00 0.00 L11 N
and
ATOM 1501 N POPE 1 -42.523 -41.722 -19.881 1.00 0.00 L21 N
which both try to put an atom with name N onto residue 1 of the chain with empty chain ID.
We treat this defensively since there are no or very few cases in the PDB that have such duplicate atoms. I reported a handful of cases to the PDB a while ago and they updated the records, so I believe it is considered a format violation.
In this case you could use different chain IDs (the empty column 22) or keep incrementing the residue number.
Since this file was written by VMD, I assume this should be reported to them? @lmiq, can you do that?
Well, there is nothing wrong with that file for the use it was designed for, so I do not think there's anything to report there, upstream. The parsing of the residues in that file is simply done by incrementing the residue counter when the residue index changes (for more or less). Having limitations in the number of residues per chain, or number of chains, etc, is something that cannot be important in MD simulations PDB files.
This is the choice to be made here: have or not the possibility of parsing non-standard files to some degree.
I perfectly understand if BioStructures keeps strictly adhering to the standard, but that would not allow us using it in the most broad context of MD simulations.
Note that this limitation, specifically, is associated to trying to read the data into the hierarchical structure, so in some sense this is related to that initial choice of representation.
Also, note the L11 and L21 segment identifiers. These are used as the "top" level of hierarchy in the context of VMD and some simulations packages, but are barely used for reporting experimental structures.
If it's true that it violates https://www.wwpdb.org/documentation/file-format (I haven't checked), I'd say that's clearly a problem. It's not OK that there's a workaround. It introduces ambiguities and issues, precisely as being discussed here.
But if the format is ambiguous, then that's another issue entirely.
It's certainly challenging to balance the considerations.
Note that this limitation, specifically, is associated to trying to read the data into the hierarchical structure, so in some sense this is related to that initial choice of representation.
This is the key, BioStructures guarantees that every atom has "meaning" about what it is (i.e. on which residue and chain), which is useful in many contexts (such as file interchange) but in turn necessitates some complexities in representation and strictness in parsing.
The only way the original case could work currently is if the existing atom is overwritten (or the new atom ignored) with a warning, which is unlikely to be desired behaviour even if it runs without error.
I guess philosophically the aim of BioStructures is to not so much to read structural files, but to represent unambiguously the molecules within them. This affects other design decisions, such as why countatoms will treat disordered versions of the same atom as one (unless expand_disordered is used).
But if the format is ambiguous, then that's another issue entirely.
Sadly, this seems to be the case. There is a lot of documentation on the column format, but as far as I know not so much on the row format (e.g. what duplicates mean). In general I prefer it when other tool authors use MODEL blocks to distinguish conformations in a MD trajectory though. That works well in BioStructures.
From the description, that file does not adhere to the standard because:
- Non-blank alphanumerical character is used for chain identifier.
and there the chain identifier is blank. But the standard does not say anything about two residues in the same chain having the same number (which is actually the issue here). For actual bio-structures it is just assumed that that does not make sense. But when part of the system is thousands of water molecules, that is just unnecessarily limiting.
Given that the format is ambiguous, then perhaps we do need to deal with ambiguity. I don't think this should be open-ended: it's kind of a disaster to take the strategy "can I find something that seems to make sense in column 5? No? OK, let's see if things seem more sensible if I substitute the value in column 11? No? OK, maybe I can infer a good value from column 13?" But if instead one passes flavor=:vmd to the reader, and that follows strict rules, things are more sensible.
when part of the system is thousands of water molecules
is this part of the issue? That (legacy) PDB only lets you use a single character for encoding the chain? In which case shouldn't VMD be writing mmCIF instead?
As a minimal example:
ATOM 4247 H HOH 1 -28.310 3.370 -27.343 1.00 0.00 HATI
ATOM 4248 H HOH 1 -27.809 4.855 -27.803 0.00 0.00 HATI
ATOM 4249 O HOH 1 -27.550 4.020 -27.318 0.00 0.00 HATI
ATOM 4250 H HOH 2 -19.010 5.493 44.300 0.00 0.00 HATI
ATOM 4251 H HOH 2 -18.958 5.838 45.896 1.00 0.00 HATI
ATOM 4252 O HOH 2 -19.119 6.215 44.984 0.00 0.00 HATI
ATOM 4253 H HOH 1 12.951 6.641 28.351 1.00 0.00 HATI
ATOM 4254 H HOH 1 13.590 7.926 29.130 0.00 0.00 HATI
ATOM 4255 O HOH 1 13.304 6.973 29.226 0.00 0.00 HATI
VMD, Pymol, Packmol, and other common simulation packages interpret this without problems. There are three water molecules there.
The issue is that this does not fit into an hierarchical structure, unless the reader creates chains arbitrarily. That would come with a bunch of problems as well.
is this part of the issue? That (legacy) PDB only lets you use a single character for encoding the chain? In which case shouldn't VMD be writing mmCIF instead?
Well, in some sense yes, the PDB format is problematic. But that ship has sailed decades ago in the MD field. Not that there aren't other formats that are more appropriate, particularly for the limitations of the coordinates fields, but the legacy PDB format is still widely used because of its readability. This is something we (MD simulation package authors) just have to live with.
ps: MIToS is reading this if one adds occupancy and b-factor fields (added now), because its hierarchy is at the residue level only, and it just not cares about the "meaning" of two residues with identical numbers in the same chain.
So maybe we need a flavor=:md mode or something? @jgreener64 would that be acceptable?
Yes, but the question still remains about how to put the two atoms into the same hierarchical structure.
We could assign an unused residue number or chain ID, but that feels clunky and runs into problems if the assigned value is found later during reading. We could have another data type that stores multiple coordinates for the same atom, but that raises concerns about how to write the file back out in the same order. We could assign duplicates into a new Model, which would mimic MODEL/ENDMDL tags but doesn't fit the case of many waters in a MD file.
Maybe the flavor should be the type of hierarchy desired: :segment, :chain, :residue (MIToS now), :atom (PDBTools) or whatever. I don't know, however, if that would make most of the other functionality of BioStructures fail to interoperate now, and thus the package very complex.
I guess my thought is that we decide on concrete flavor-specific behavior and document it. Passing a flavor might require that the entire file be scanned first to identify "occupied" names/indices/whatever, which would be a bit of a performance hit, but perhaps the price of abusing a file format.
I think it is a little more complicated than that. If we keep the hierarchy we are forced to attribute some of the fields arbitrarily. That will cause confusion.
Isn't the hierarchy about how you assign bonds? Meaning, you have implicit hierarchy any time you want to link entries in rows together.
The issue is that multiple atoms want the same "spot" in the hierarchy, because the file contains multiple objects which have the same name. So you either have to discard one atom, have multiple versions of the hierarchy, or store duplicate atoms in that spot (beyond disorder, which we already do).
We could have a flag renumber_residues when reading that ignores the residue numbering in the file, and increments the assigned residue number every time the residue number changes. This would read cases like
ATOM 4247 H HOH 1 -28.310 3.370 -27.343 1.00 0.00 HATI
ATOM 4248 H HOH 1 -27.809 4.855 -27.803 0.00 0.00 HATI
ATOM 4249 O HOH 1 -27.550 4.020 -27.318 0.00 0.00 HATI
ATOM 4250 H HOH 2 -19.010 5.493 44.300 0.00 0.00 HATI
ATOM 4251 H HOH 2 -18.958 5.838 45.896 1.00 0.00 HATI
ATOM 4252 O HOH 2 -19.119 6.215 44.984 0.00 0.00 HATI
ATOM 4253 H HOH 1 12.951 6.641 28.351 1.00 0.00 HATI
ATOM 4254 H HOH 1 13.590 7.926 29.130 0.00 0.00 HATI
ATOM 4255 O HOH 1 13.304 6.973 29.226 0.00 0.00 HATI
as
ATOM 4247 H HOH 1 -28.310 3.370 -27.343 1.00 0.00 HATI
ATOM 4248 H HOH 1 -27.809 4.855 -27.803 0.00 0.00 HATI
ATOM 4249 O HOH 1 -27.550 4.020 -27.318 0.00 0.00 HATI
ATOM 4250 H HOH 2 -19.010 5.493 44.300 0.00 0.00 HATI
ATOM 4251 H HOH 2 -18.958 5.838 45.896 1.00 0.00 HATI
ATOM 4252 O HOH 2 -19.119 6.215 44.984 0.00 0.00 HATI
ATOM 4253 H HOH 3 12.951 6.641 28.351 1.00 0.00 HATI
ATOM 4254 H HOH 3 13.590 7.926 29.130 0.00 0.00 HATI
ATOM 4255 O HOH 3 13.304 6.973 29.226 0.00 0.00 HATI
but would still error on
ATOM 4247 H HOH 1 -28.310 3.370 -27.343 1.00 0.00 HATI
ATOM 4248 H HOH 1 -27.809 4.855 -27.803 0.00 0.00 HATI
ATOM 4249 O HOH 1 -27.550 4.020 -27.318 0.00 0.00 HATI
ATOM 4250 H HOH 1 -19.010 5.493 44.300 0.00 0.00 HATI
ATOM 4251 H HOH 1 -18.958 5.838 45.896 1.00 0.00 HATI
ATOM 4252 O HOH 1 -19.119 6.215 44.984 0.00 0.00 HATI
ATOM 4253 H HOH 1 12.951 6.641 28.351 1.00 0.00 HATI
ATOM 4254 H HOH 1 13.590 7.926 29.130 0.00 0.00 HATI
ATOM 4255 O HOH 1 13.304 6.973 29.226 0.00 0.00 HATI
Renumbering like this is a useful feature anyway that is available in a number of PDB packages.
I'm not sure if I understand your comment. In the MD files, I think it is safe to assume that the "residue" is the object by excellence. But the "residue number", or "name", "chain", or any other property, are just meaningless labels that help one to draw or select different parts of the structure.
Thus, if there was an option to just flatten the hierarchy to have "residues" at the top level, the MD files could be parsed without major issues (as MIToS does now, and PDBTools parses at the atom level but does provide an iterator over residues).
What I can´t say is that having the file parsed in a different hierarchy would require all the other functions of BioStructures to have additional methods, thus effectively creating a parallel package within it.
By the way, I noticed now that:
julia> pdb = read("./test.pdb", BioStructures.PDB)
ProteinStructure test.pdb with 1 models, 2 chains (A, ), 284 residues, 2078 atoms
julia> typeof(pdb)
BioStructures.ProteinStructure
that per-se would not fit quite well in MD files, as many of them do not have any protein whatsoever.
We could have a flag
renumber_residueswhen reading that ignores the residue numbering in the file,
That would help, but I think it would still fall short, as the hierarchy on chains is not really meaningful in this context.
Concerning this:
ATOM 4247 H HOH 1 -28.310 3.370 -27.343 1.00 0.00 HATI
ATOM 4248 H HOH 1 -27.809 4.855 -27.803 0.00 0.00 HATI
ATOM 4249 O HOH 1 -27.550 4.020 -27.318 0.00 0.00 HATI
ATOM 4250 H HOH 1 -19.010 5.493 44.300 0.00 0.00 HATI
ATOM 4251 H HOH 1 -18.958 5.838 45.896 1.00 0.00 HATI
ATOM 4252 O HOH 1 -19.119 6.215 44.984 0.00 0.00 HATI
ATOM 4253 H HOH 1 12.951 6.641 28.351 1.00 0.00 HATI
ATOM 4254 H HOH 1 13.590 7.926 29.130 0.00 0.00 HATI
ATOM 4255 O HOH 1 13.304 6.973 29.226 0.00 0.00 HATI
That specifically is more complicated. VMD and Pymol recognize that only because they compute the connectivity from the distances and deduce that there are 3 residues there. That I think is beyond what is expected for a reading package. Also, Amber, for instance, just requires TER to be added after each molecule, and ignores everything else.
as many of them do not have any protein whatsoever
The name ProteinStructure is an early design decision I have long regretted. Perhaps if we do a breaking release for https://github.com/BioJulia/BioStructures.jl/issues/49 we can rename it to MolecularStructure or something. Hopefully it shouldn't break too much code, and it should be easy enough to search and make PRs for in other repos.
But the "residue number", or "name", "chain", or any other property, are just meaningless labels that help one to draw or select different parts of the structure... Thus, if there was an option to just flatten the hierarchy to have "residues" at the top level, the MD files could be parsed without major issues
I think I begin to see your point. You're basically saying that it's OK with you if all the waters use the same label and you can't select them individually, right? That's a valid perspective if either (1) you don't actually care about bonds, or (2) there's an external program to assign bonds. I'd guess that few people are actually in camp 1, but camp 2 is presumably common. Is there any ambiguity, though, about assigning bonds simply by distance? If so, then the reader is still partially responsible for indicating hierarchy, like via the TER that Amber requires.
So to me it seems that some kind of renumbering, based on an unambiguous termination indicator, might be a reasonable solution?
Bonds (or more generally the topology of the molecules) are defined in different files in MD simulations. In the PDB files they can be written with the CONECT keyword, at the end, but most commonly (never? not sure) these are not used.
There are certainly problems in assigning bonds based on distances. Sometimes atoms from different molecules are too close and that causes errors. VMD throws erros/warnings in these cases, when it detects that some atom appears to have more bonds that allowed. But assigning bonds based on distances is useful but only a workaround when the topology files are not provided.
I think I begin to see your point. You're basically saying that it's OK with you if all the waters use the same label and you can't select them individually, right?
You could still select one individually by a residue counter (which might or not match the "residue number" written in the PDB file), exactly how it happens when such files are read by MIToS), where the residue counter is just the index of that residue in the residue array that results from reading.
So to me it seems that some kind of renumbering, based on an unambiguous termination indicator, might be a reasonable solution?
The packages dissociate the residue and atom counts from the "residue number" and "atom index" as written in the PDB files. What we do, which I think is a general enough solution, is that the residue counter is increased whenever any of the residue labels change: residue number, chain, residue name, segment. If any of those change, we assume that a new residue started.
I'd favor strict-by-default and requiring that you pass renumber=true to activate that mode, but renumbering seems like a sensible solution.
is that the residue counter is increased whenever any of the residue labels change: residue number, chain, residue name, segment. If any of those change, we assume that a new residue started.
Just to check, the case in https://github.com/BioJulia/BioStructures.jl/issues/48#issuecomment-2129597753 doesn't fit that description, does it? Are you wishing that would work anyway or are you OK with requiring some kind of termination signal?
What we do, which I think is a general enough solution, is that the residue counter is increased whenever any of the residue labels change: residue number, chain, residue name, segment. If any of those change, we assume that a new residue started.
I'd favor strict-by-default and requiring that you pass renumber=true to activate that mode, but renumbering seems like a sensible solution.
This is what I had in mind with renumber_residues, I can look into making the change. Having different molecules in the same chain is quite common in the PDB, e.g. for waters.
Note, this would mean you can't write the original PDB file back out as we would discard the residue numbers in the file.
This is what I had in mind with
renumber_residues, I can look into making the change. Having different molecules in the same chain is quite common in the PDB, e.g. for waters.
Uhm... I don't think residue renumbering is the correct approach. We do not want residue numbers to be changed when reading protein residues, for example.
What is necessary is to dissociate the residue counter from the residue number as written in the PDB file. (And I would say also dissociate the atom counter from the atom index as written in the PDB file). Note: VMD has different attributes for each: resnum and resid. In PDBTools I also have that, and also added pdb_index for the atom index as written in the PDB file.
I see now that the residues are stored in a Dict in BioStructures, which makes things slightly harder. If it was an ordered Dict, the index of the residue in the ordered Dict would be the residue counter.
I wander if having the residues of each chain in a vector wouldn't be more appropriate - I'm not very comfortable with the order of the residues in the file being meaningless. For instance this is how the residues of h6n6 appear:
julia> pdb[1].chains["A"].residues
Dict{String, BioStructures.AbstractResidue} with 282 entries:
"407" => Residue 407:A with name LYS, 9 atoms
"371" => Residue 371:A with name ARG, 11 atoms
"447" => Residue 447:A with name ILE, 8 atoms
"335" => Residue 335:A with name ASN, 8 atoms
not having them in order is somewhat strange.
Just changing the dicts to OrderedDict provides:
julia> pdb[1].chains["A"].residues
OrderedDict{String, AbstractResidue} with 282 entries:
"225" => Residue 225:A with name SER, 6 atoms
"226" => Residue 226:A with name ALA, 5 atoms
"227" => Residue 227:A with name ASN, 8 atoms
"228" => Residue 228:A with name GLU, 9 atoms
which then, if read with the "new residue approach", would implicitly store both the counter and the residue numer. But I would probably prefer having a ResidueVector type with a overloaded getindex function to retrieve residue["225"] when appropriate.
Or having to explicitly use something like residue[ResID(225)], as having the residue number as a string is already not ideal.
Just to check, the case in #48 (comment) doesn't fit that description, does it? Are you wishing that would work anyway or are you OK with requiring some kind of termination signal?
@timholy sorry, missed this comment. Yes, that case I think we can leave out. Maybe support TER after each molecule as workaround, as an important MD package (Amber) uses that, but it would be very specific (I had to support that in Packmol, nevertheless).
Am I understanding right that the key in the OrderedDict case would be the residue number from the file, and the ordering is based on the residue counter? How would this work in the case of multiple residues with the same residue number in the file (since that would be multiple values with the same key)?
I guess this could be solved with a vector type, but then what should be returned when you index into it with "225", considering that there might be multiple residues with that residue number in the file?
I'm not very comfortable with the order of the residues in the file being meaningless
The order is not stored directly in the object, but the residues are sorted when writing out. This means that different file representations of the same underlying molecule get written out to the same file, but you do lose the ordering of the input file. In the usual case of ascending residue numbers in the input file (possibly with gaps), this is preserved in the output file.
We do not want residue numbers to be changed when reading protein residues, for example.
I think @jgreener64's point is that this admirable goal makes things harder---if you only increment the counter when something is in conflict, what happens if a later structure in the file uses the number you "stole" in order to do the increment? Two possible solutions are (1) to parse the file twice to identify all conflicts in advance, or (2) transiently distinguish "unconflicted" ids and "conflicted" ids and then assign final ids after all unconflicted ones have been assigned. While I proposed (1) above, maybe (2) is the better choice.
I see now that the residues are stored in a Dict in BioStructures, which makes things slightly harder. If it was an ordered Dict, the index of the residue in the ordered Dict would be the residue counter.
I've learned that there's a separate field that gives the order of the keys, and you can access the dict with integer indexing as dict[keyorder[i]]. So effectively there is an OrderedDict even if it isn't represented that way. I'm not sure I fully understand how disordered locations (when you have more than one location for a residue or atom) interact that way, though, which I think is the main purpose behind using strings as residue keys.
@timholy...yes, that case I think we can leave out.
👍
I think I need to give a step back and explain how we use the data here.
From our perspective, all fields of the PDB are just labels. Apart from the data of the fields, in the process of reading the file, one annotates, incrementally, an independent residue counter and an independent atom index counter.
At the end, we have a vector of atoms where each atom carries all that information:
julia> pdb = PDBTools.readPDB("/home/leandro/Downloads/6hn6.pdb")
Array{Atoms,1} with 2090 atoms with fields:
index name resname chain resnum residue x y z occup beta model segname index_pdb
1 N SER A 225 1 43.004 80.351 76.389 1.00 118.26 1 - 1
2 CA SER A 225 1 42.216 81.220 75.509 1.00 116.77 1 - 2
3 C SER A 225 1 42.952 81.518 74.177 1.00 118.83 1 - 3
⋮
2088 O HOH A 638 306 23.920 74.193 77.532 1.00 55.18 1 - 2089
2089 O HOH A 639 307 6.395 85.305 51.528 1.00 104.62 1 - 2090
2090 O HOH A 640 308 19.024 61.159 68.618 1.00 64.07 1 - 2091
There, for instance, index is the incremental atom counter, and residue is the incremental residue counter, while index_pdb is the content of the index field in the PDB file (rarely used) and resnum is the content of the residue number field in the PDB file.
Then, working with such a data structure consists in applying filters, which return vectors of atoms, or indexes:
julia> filter(sel"resnum < 300", pdb)
Array{Atoms,1} with 587 atoms with fields:
index name resname chain resnum residue x y z occup beta model segname index_pdb
1 N SER A 225 1 43.004 80.351 76.389 1.00 118.26 1 - 1
2 CA SER A 225 1 42.216 81.220 75.509 1.00 116.77 1 - 2
3 C SER A 225 1 42.952 81.518 74.177 1.00 118.83 1 - 3
⋮
585 CG1 ILE A 299 75 23.370 60.986 68.597 1.00 44.83 1 - 585
586 CG2 ILE A 299 75 23.842 62.784 70.389 1.00 46.63 1 - 586
587 CD1 ILE A 299 75 24.450 60.003 68.979 1.00 49.51 1 - 587
julia> findall(sel"residue = 1", pdb)
6-element Vector{Int64}:
1
2
3
4
5
6
We do not need these filtering operations to be particularly fast, or lazy. But we do need to be able to select subsets of the structure with great versatility, potentially using incremental or pdb-defined residue numbers (very common), or any other atom property.
That data structure doesn't prevent us from iterating over residues, with an appropriate iterator:
julia> for res in eachresidue(filter(sel"residue <= 2", pdb))
println(resname(res)," ",resnum(res))
end
SER 225
ALA 226
and we could (but didn´t yet) define similar iterators for models, chains, etc, of course. We just don't use those as often.
These way dealing with the data does not have restrictions about duplicate residues, residue numbers, etc. If two residues have the same data, they will just be filtered together, as we would expect them to. Still we can differentiate them by their incremental residue counter and incremental atom indices. There are no conflicts when reading the data, and no special issues associated to repeated fields.
These way of storing and using the data is convenient for us. And I think that we must understand why or how it is not convenient for other people, to justify having different data structures, and to which extent.
For us, using a syntax like pdb["A"].residues["247"].atoms[1] is not very useful nor practical, since most of the time we are selecting subsets of atoms that may belong to different chains, residues, or molecules, like element O and water, or protein and backbone. Having the underlying data in a tree does not really help much.
We've also gotten a lot of benefit from random-access indexing. For example we might pick out all the positively-charged residues and examine their spatial distribution. We also compute displacement vectors from the alpha carbon to the side chain center-of-mass to determine whether residues in a 7TM are "interior" or "exterior." And so on.
I don't think the pdb["A"] portion of this will get in our way, but indeed when it gets down to the residue level we need utilities that seem missing from BioStructures. I'm optimistic all that is resolvable but it may take some careful design. And I agree that it can't successfully be a platform for the whole community unless we find ways to make that kind of thing easy.
That said, I think BioStructures is doing something important: it focuses on structural representation rather than file format, and it can get multiple file formats into that same representation. It also seems to take the complexities of that process seriously. It's why I'm prepared to do a fair amount of work to port our own code over to BioStructures, if the technical issues (convenience, comprehensiveness, speed, heaviness of dependency, etc) can all be resolved. I for one am quite optimistic that this is the case. And if not, I think the goal is worthy even if we have to have a fresh start. Fundamentally, the current fragmentation of the representations for structures is very detrimental to the long-term health of the Julia ecosystem for questions that relate to structural biology.
I have a conference deadline in a few weeks, so I am not sure how much I can do between now and then, but this is something I'm willing to put some work into to help make it happen.