biopython Flipping origin-spanning locations with strand=None in circular molecules

Setup

I am reporting a problem with Biopython version, Python version, and operating system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import Bio; print(Bio.__version__)

3.11.6 (main, Nov  2 2023, 04:51:19) [Clang 14.0.0 (clang-1400.0.29.202)]
CPython
macOS-12.6.7-x86_64-i386-64bit
1.81

In Biopython and GenBank files, locations that span the origin are represented by a compound location. For example, for a sequence of length 6 a feature that spans two residues before the origin and the first residue after the origin is represented as join{[4:6](+), [0:1](+)}. When we flip that feature, it becomes join{[0:2](-), [5:6](-)}. The order of the loc.parts and the strand is inverted, so the meaning of the flipping is clear.

However, when a feature without strand spans the origin in the same way (join{[4:6], [0:1]}) and gets flipped, the parts get inverted (join{[0:2], [5:6]}). It might be confusing to interpret that this is a feature that spans the origin.

The same applies to multi-part features, see example below:

from Bio.SeqFeature import SimpleLocation, SeqFeature
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.SeqIO import write
loc = SimpleLocation(4, 6, 1) + SimpleLocation(0, 1, 1)
print(loc) # join{[4:6](+), [0:1](+)}
print(loc._flip(6)) # join{[0:2](-), [5:6](-)}

loc = SimpleLocation(4, 6, None) + SimpleLocation(0, 1, None)
print(loc) # join{[4:6], [0:1]}
print(loc._flip(6)) # join{[0:2], [5:6]}


# Example of a multi-part location that spans the origin of a sequence with length 20
loc = SimpleLocation(15, 16, 1) + SimpleLocation(18, 20, 1) + SimpleLocation(0, 1, 1) + SimpleLocation(5, 10, 1)
print(loc) # join{[15:16](+), [18:20](+), [0:1](+), [5:10](+)}
print(loc._flip(20)) # join{[4:5](-), [0:2](-), [19:20](-), [10:15](-)}
seqr = SeqRecord(Seq('A'*20), features=[SeqFeature(loc, type='misc_feature')], annotations={'molecule_type': 'DNA'})
write([seqr], 'example_strand1_fwd.gb', 'genbank')
seqr_rvs = seqr.reverse_complement()
seqr_rvs.annotations['molecule_type'] = 'DNA'
write([seqr_rvs], 'example_strand1_rvs.gb', 'genbank')

loc = SimpleLocation(15, 16, None) + SimpleLocation(18, 20, None) + SimpleLocation(0, 1, None) + SimpleLocation(5, 10, None)
print(loc) # join{[15:16], [18:20], [0:1], [5:10]}
print(loc._flip(20)) # join{[4:5], [0:2], [19:20], [10:15]}
seqr = SeqRecord(Seq('A'*20), features=[SeqFeature(loc, type='misc_feature')], annotations={'molecule_type': 'DNA'})
seqr_rvs = seqr.reverse_complement()
seqr_rvs.annotations['molecule_type'] = 'DNA'
write([seqr], 'example_strandNone_fwd.gb', 'genbank')
write([seqr_rvs], 'example_strandNone_rvs.gb', 'genbank')

In the example, I save the sequences with multi-part features to gb files. As you can see, snapgene interprets properly all cases, except the inverted feature without strand. You can access gb files here.

Screenshot 2024-01-31 at 12 06 38

If you agree that this is something that should be changed, I can give it a go and propose a solution.

Jan 31 '24 12:01 manulera

This is handled by the private ._flip method of the CompoundLocation class:

https://github.com/biopython/biopython/blob/biopython-183/Bio/SeqFeature.py#L1641

As per the motivating example documented there, the behaviour assumes the original feature's part order is meaningful (which is true for a gene or CDS record, but perhaps not true for your misc_feature?) and therefore deliberately preserves the part order.

Jan 31 '24 12:01 peterjc

assumes the original feature's part order is meaningful and therefore deliberately preserves the part order

Hi @peterjc, I agree with this when the strand is indicated, but not sure if that should be the case for features without a strand if we consider the origin-spanning case. For instance, say we wanted to indicate a recombination hotspot, or something that does not have an orientation. If it would fall on the origin, then the example could be the same as above:

loc = SimpleLocation(4, 6, None) + SimpleLocation(0, 1, None)
print(loc) # join{[4:6], [0:1]}
print(loc._flip(6)) # join{[0:2], [5:6]}

The most important consequence (I think) is that the second feature will not be interpreted correctly by SnapGene or other sequence viewers. The solution for this would be to invert the parts when flipping a feature with strand = None. In a way this makes some sense because when Biopython writes features to GenBank, it treats strand=None features as plus-strand features, so to be plus-like in the reverse-complemented Seqrecord, the elements should be inverted.

Perhaps the multi-part origin-spanning example without strand is not very meaningful, I struggle to imagine a use-case for it.

I don't feel very strongly about this, but the sequence viewer case makes me prefer the alternative solution.

Jan 31 '24 15:01 manulera

So your suggestion would be reverse the parts order if all the parts have no strand? That seems OK to me - if any have a strand then the orientation and order is in someway meaningful.

[And yes, GenBank/EMBL locations sadly do not have an explicit way to say no strand]

Feb 01 '24 11:02 peterjc

Happy to contribute a pull request if you don't think this will be disruptive / cause problems

Feb 01 '24 11:02 manulera

I think this proposal is narrow enough not to cause many problems - and should fix your use case. I'd be happy to review a pull request on this.

Feb 01 '24 12:02 peterjc