biojava Better support for symmetry in the Structure model

PDB and mmCIF files use symmetry operators to reduce the number of atoms which need to be specified. This is used with NCS to reconstruct the asymmetric unit (e.g. in viruses), as well as for specifying biological assemblies (BA).

BioJava is able to read and understand the symmetry operations required to generate the full structure, but the data model is far from ideal for this. Most analysis applications require all atom positions to be stored in an array, so we would like to be able to store a representation of the full biological assembly. The problem is that the current model assumes that chainIDs are unique within a particular model. Thus, dealing with BAs requires one of these work-arounds:

Use multiple Structure objects. Cons: Can't use methods requiring a single Structure object. Some methods which take Atom[] assume that all atoms share a structure through the getParent() hierarchy, e.g. for cloning atom arrays. Structure metadata (e.g. header data) is lost, duplicated, or inconsistent.
Rename chainIDs so that all are unique. Workable, now that mmCIF officially supports 4 character chains. Cons: Difficult to map back to original chainID and symmetry operations. No write support (pending #188).
Use multiple models within a single Structure. Current approach for RCSB-supplied BA files and for structure alignment files. Cons: Only intended for NMR structures. Difficult to map back to original symmetry operation. Other tools (specifically pymol, but also jmol to a lesser extent) expect models to have identical contents and be superimposed.

A long-term solution would be to associate chains with a particular NCS and CS operator. These could be additional objects in the Structure hierarchy, or could just be fields in Chain. For instance:

Structure
Model
Unit Cell
Asymmetric Unit
Chain
Group
Atom

Dec 12 '14 12:12 sbliven

I support a long-term solution. All the workarounds cause problems as you’ve mentioned below. -Peter

Dec 13 '14 00:12 pwrose

What about different implementations of Structure (and the other interfaces?). I never intended that there would be only one StructureImpl, ChainImpl, etc. and there could be alternative implementations that provide access to dynamically generated transformed objects.

Dec 13 '14 01:12 andreasprlic

@andreasprlic Alternative implementations could work, but we would still have to clarify how the Structure API should deal with redundant chains. We could have sparse and verbose representations of large structures in memory, but the interfaces still need a way to uniquely identify chains.

Dec 15 '14 09:12 sbliven

I discussed how Jmol handles multiple copies of chains with Bob Hanson, since it handles symmetry quite well. Basically, Jmol has properties for the symop (unique to each asymmetric unit) and molecule number (unique to each covalently connected molecule, I think). These are computed lazily, so they don't slow down parsing unless you actually reference them.

"molecules" aren't numbered in Jmol. When you ask for "molecule=1" Jmol will do a connectivity analysis for all the "molecules" and give you the first. If the molecules are from symmetry operations, the asymmetric unit molecules will be first (operator 1, which is always "x,y,z"), then operator 2, then 3, etc., including individual rotations of a C3 axis. Note that some operators may generate the same atom (those in "special positions"), and in that case, there will not be duplications. But I think that is not relevant to biomolecules. You can also access specific atoms by their symmetry operators: select symop=1; select symop=2; etc.

We could do a similar lazy expansion of structures when needed. Of course, it still means that the identifier for a particular chain needs to include the either the operators used to generate that chain or else some identifier similar to the Jmol molecule index.

BTW, I realized I didn't define the terms NCS and CS in the opening post. For future reference:

NCS = non-crystallographic symmetry, e.g. that needed to reconstruct the asymmetric unit in viruses (MTRIX* lines in PDB)
CS = crystallographic symmetry, which comes from the space group (REMARK 290).
BA = biological assembly, of which there may be many (REMARK 350). Can be generated from operators from either NCS or CS.

Dec 18 '14 14:12 sbliven

In Jmol you can access a particular chain by chainId and symop, for example :A&symop=3. I think this would fit quite well with BioJava which operates on chains, rather than molecules at this point.

I like the idea of lazy expansion. Taken a step further, under low memory conditions, symmetry copies that are not currently being used could be garbage collected, i.,e., by using a soft hash map.

-Peter

From: Spencer Bliven <[email protected]mailto:[email protected]> Reply-To: biojava/biojava <[email protected]mailto:[email protected]> Date: Thursday, December 18, 2014 at 6:07 AM To: biojava/biojava <[email protected]mailto:[email protected]> Cc: Peter Rose <[email protected]mailto:[email protected]> Subject: Re: [biojava] Better support for symmetry in the Structure model (#220) Resent-From: Peter Rose <[email protected]mailto:[email protected]>

I discussed how Jmol handles multiple copies of chains with Bob Hanson, since it handles symmetry quite well. Basically, Jmol has properties for the symop (unique to each asymmetric unit) and molecule number (unique to each covalently connected molecule, I think). These are computed lazily, so they don't slow down parsing unless you actually reference them.

"molecules" aren't numbered in Jmol. When you ask for "molecule=1" Jmol will do a connectivity analysis for all the "molecules" and give you the first. If the molecules are from symmetry operations, the asymmetric unit molecules will be first (operator 1, which is always "x,y,z"), then operator 2, then 3, etc., including individual rotations of a C3 axis. Note that some operators may generate the same atom (those in "special positions"), and in that case, there will not be duplications. But I think that is not relevant to biomolecules. You can also access specific atoms by their symmetry operators: select symop=1; select symop=2; etc.

We could do a similar lazy expansion of structures when needed. Of course, it still means that the identifier for a particular chain needs to include the either the operators used to generate that chain or else some identifier similar to the Jmol molecule index.

BTW, I realized I didn't define the terms NCS and CS in the opening post. For future reference:

NCS = non-crystallographic symmetry, e.g. that needed to reconstruct the asymmetric unit in viruses (MTRIX* lines in PDB)
CS = crystallographic symmetry, which comes from the space group (REMARK 290).
BA = biological assembly, of which there may be many (REMARK 350). Can be generated from operators from either NCS or CS.

— Reply to this email directly or view it on GitHubhttps://github.com/biojava/biojava/issues/220#issuecomment-67489789.

Dec 18 '14 17:12 pwrose

I think the solution should come through introducing a more general identifier for molecules within the Structure. Currently this is the situation:

the chain id is an identifier within the asymmetric unit, i.e. an asym id
the model id is an identifier for multiple conformations

Those 2 have well-defined uses, so we shouldn't touch them, we need to introduce a new identifier, let's call it biological assembly identifier (BA id) . The identifier would then be used in this hierarchy:

Structure (a biological assembly)
Model (a conformation)
Chain (identified by a biological assembly id)
Group
Atom

The hierarchy would roughly work like this:

Structure internally uses the BA id to keep track of all chains: like that it can contain both AU chains and other symmetry-expanded chains. The Chains contain both the BA id (the "primary key") and the asym id as a subpart of the BA id
The symmetry generated chains are not expanded unless needed explicitly (can be done lazily for instance), e.g. if we want the coordinates of the whole BA then we can expand the needed symmetry partners, indicated in the biounit transformations in the header
Thus by default the Chains in Structure are just those in the AU and then, with no expansion, the asym ids (chain ids) can be treated as usual
The BA id can be composed of: asym id, NCS op and CS op. It should really be its own class, something like:

public class BioAssemblyId {
   private String asymId;
   private Matrix4d ncsOp;
   private Matrix4d csOp;
}

If one wants to write a symmetry-expanded Structure out to file (PDB, mmCIF) then a convention has to be defined, be it using model ids or modified chain ids. This is in any case independent from the internal representation (see also related issue #226)

This is more or less putting together all of the ideas above. Does this sound reasonable or are there any obvious flaws?

Jan 08 '15 14:01 josemduarte

@josemduarte As I understand it, in your scheme it would not be possible to store multiple biological assemblies in a single structure. I would tend towards making a Structure object correspond roughly to an mmCIF file, meaning that it should include information about all ~~biological~~ assemblies and allow iterating through them. I cross out biological, because the assemblies could also include asymmetric unit assemblies (viruses) and unit cell assemblies.

I think you're right about chains needing asymId, ncsOp, and csOp to be uniquely identified. I would make the 'assembly' a virtual level of the hierarchy, like model is now. You have to specify some value for the assembly number whenever you iterate over the chains (maybe defaulting to the unique coordinates from the file). These would be implemented in a lazy manner as a reference to a ChainImpl object and the rotation matrices.

Jan 21 '15 14:01 sbliven

I would also like to give my opinion on the topic, because I encountered some problems while doing structural alignments with Biological Assemblies (BA). The structural alignment algorithms were designed for the comparison/analysis of single domains or chains, but there is a specific problem in the symmetry analysis (identify combined quaternary+internal point groups) that requires running the alignment on the full BA. Furthermore, going into the future we might want the StructureAlignment interface to support the comparison of full BA, in which case these problems will also come up.

The problems:

Get the representative Atom array: the input of the algorithms is an Atom array of the structure, containing the representative (backbone) Atoms. We need all the backbone Atoms in the BA, which includes the Atoms of the symmetry generated chains. The method StructureTools.getRepresentativeAtomArray(Structure s) returns only the Atoms in the first model, avoiding the symmetry generated chains.
Display the alignment in Jmol: currently in BioJava, each of the Structures aligned is stored as a new model in a PDB file, which is then sent to Jmol for display. We clearly need an extra level for displaying two aligned Biological Assemblies, because the models are used both for symmetry generated chains and for structures aligned.

The solutions:

The first problem is generated by the overuse of the model index. It means something different for NMR structures and XTAL structures, and as @sbliven said, models are expected to have identical contents to be superimposed (hence the name model). In my opinion, the current use of models for storing symmetry generated chains needs to be changed, but in the meanwhile the method StructureTools.getRepresentativeAtomArray(Structure s) could check if the Structure is NMR and use the first model only if true, or use all models to obtain representative Atoms if false (because the models contain the symmetry generated chains).
This is more of a communication problem between BioJava and Jmol, but also shows that using multiple models for representing the Biological Assembly is not a good solution. It also does not make sense to use different models to store the superimposed structures, since they should be ideally stored in different Structure objects with their coordinates changed to be superimposed (discussed in #201). There is no provisional solution to this problem with the current implementation, because one of the two (BA or structures superimposed) should not be stored as a model.

In conclusion, I agree with @josemduarte in that the long-term solution is to introduce an additional identifier for BA. This will solve problem 1. Because the file formats are fixed, the additional identifier will not solve problem 2, but I think this problem has to be fixed on the Jmol side (being able to display multiple Structures).

Sep 21 '15 13:09 lafita

Let's override getRepresentativeAtomArray with an additional parameter indicating whether all models should be included. Yes, atoms would no longer have unique ids, but they would still have non-identical Chain objects (with identical chain ids). We could also a method to rewrite chain ids if needed (e.g. A_0, A_1, etc).

I can't think of any cases off the top of my head where having multiple Chain objects with the same ID would break the current structure alignment framework, although it's possible there are some maps somewhere with String keys instead of Chains (e.g. in AlignmentTools).

Sep 22 '15 12:09 sbliven

Btw, do any of you RCSB guys know if there are plans for incorporating multiple structures into a single mmcif file in the future, through file format improvements?

Sep 22 '15 12:09 sbliven

I'm not aware of any plans for multi-structure files. Since each data section starts with data_xxx, it should be straight forward to have multiple structures in a single file. You may want to check the original cif papers to see if they mention multi-structure files.

On Tue, Sep 22, 2015 at 5:06 AM, Spencer Bliven [email protected] wrote:

Btw, do any of you RCSB guys know if there are plans for incorporating multiple structures into a single mmcif file in the future, through file format improvements?

— Reply to this email directly or view it on GitHub https://github.com/biojava/biojava/issues/220#issuecomment-142269223.

Peter Rose, Ph.D. Site Head, RCSB Protein Data Bank West (http://www.rcsb.org) San Diego Supercomputer Center (http://bioinformatics.sdsc.edu) University of California, San Diego +1-858-822-5497

Sep 22 '15 15:09 pwrose