FOGSAA PairwiseAligner implementation
-
[X] I hereby agree to dual licence this and any previous contributions under both the Biopython License Agreement AND the BSD 3-Clause License.
-
[X] I have read the
CONTRIBUTING.rstfile, have runpre-commitlocally, and understand that continuous integration checks will be used to confirm the Biopython unit tests and style checks pass with these changes. -
[X] I have added my name to the alphabetical contributors listings in the files
NEWS.rstandCONTRIB.rstas part of this pull request, am listed already, or do not wish to be listed. (This acknowledgement is optional.)
Closes #4776
This PR adds an implementation of the Fast Optimal Global Alignment Algorithm (FOGSAA) to Bio.Align.PairwiseAligner. Right now this PR is still a work in progress and not ready to be merged. The implementation is still in its infancy and there are likely a number of bugs. I wanted to make a PR to give time for feedback on the code and discussion on whether or not adding FOGSAA to PairwiseAligner is even a good idea.
The majority of this PR is taken and adapted from Angana Chakraborty's C++ code. In the linked issue @CallMeMisterOwl has indicated the author has OK'd the code for use in BioPython, but we'll probably want to check in again if this PR will eventually be merged.
Right now only the DNA scoring algorithm has been added. Matrix scoring for proteins and alignment are still TODO. To make writing the code easier for me, I haven't put everything in macros like the other algorithms, but this will be done before the final version. There are some flaws in the DNA scoring implementation. Chakraborty's code only uses integers for scoring, but the rest of PairwiseAligner uses doubles. Before the final version this will be addressed (it will require replacing/changing the queue data structure used in the original). The code also creates a full 2D alignment matrix, but some sources (ex, this README) say the algorithm can be done without it. There are almost certainly more optimizations that can be done too.
PairwiseAligner selects the best algorithm by the parameters given. In the linked issue, it was suggested that FOGSAA be made the default algorithm, but IMO that won't be a good idea until after the final version has been battle-tested in production. For now, I've added a setter to the .algorithm property of PairwiseAligner. I think this setter is a good idea on its own right, but if others disagree it can be removed in the final version.
I'll update this description as progress continues.
Update August 9 (commit https://github.com/biopython/biopython/pull/4784/commits/793ff19af506c4efa52ff0aa2aad7ea9cec3adca): Support for scoring with a scoring matrix should now be implemented. The code has now been moved into macros similar to the other algorithm implementations.
Update August 14 (commit https://github.com/biopython/biopython/pull/4784/commits/3f398f57e5aa0c68c04448ecb6a9adc784514e98): Per mdehoon's suggestions, the .algorithm setter has been removed. The FOGSAA algorithm is now selected by setting the mode attribute of PairwiseAligner to "fogsaa".
Update August 22: Commits https://github.com/biopython/biopython/pull/4784/commits/b0c365f85a0ae9bc0c78ac3f2511cfe4105d7a0e and https://github.com/biopython/biopython/pull/4784/commits/98eb4f5355937c92f240322b2cefef5692a1afd5 should now add basic support for aligning two sequences with FOGSAA.
~~Note to self (and others following this PR, I guess): Tests are failing on macOS and Windows because of a bug in NumPy (see https://github.com/biopython/biopython/pull/4797). Passes on Linux and is working locally.~~ The problem has been fixed.
I haven't had much time to work on this over the past month, but I should have more time soon. Right now the only feature yet to be implemented is support for different gap scores on the ends of strands. ~~A bigger problem though seems to be performance. From my rudimentary benchmarking with a few random sequences from GenBank, it looks like FOGSAA is running at least 10x slower than Needleman-Wunsch. Right now I'm not sure where this is coming from (my current guesses are inefficiencies in the priority queue, incorrect bounds, or some other sort of bug); I'll need to do some more testing.~~ Nevermind about the performance; most of the slowness was from running Python with the debug allocator. It still is usually slower than the Needleman-Wunsch code, but runs faster than the original Chakraborty code.
I'll also do more extensive testing on the correctness of this code. My last few commits fixed a good number of bugs, but there very well could be more bugs I haven't caught yet.
Odd. Tests are failing on macOS with Python 3.12.6. My dev environment is macOS with Python 3.12.5 the tests pass fine.
The exception being raised should only be raised if the lower bound is still less than the upper bound when all feasible paths have been exhausted, but a debug message I added says the lower bound is 0.200000 and the upper bound is also 0.200000, so no exception should be raised. I don't really know what's going on but I suspect some kind of floating point comparison bug.
Edit: it was in fact a floating point comparison bug
I think this PR should be out of draft stage now. As long as I haven't missed anything, this FOGSAA implementation should give the same results as the Chakraborty code and implement all the features all the other PairwiseAligners have.
@mdehoon Do you want to take a look over this again? I know it's a lot of code, so no rush.
Odd, it looks like CircleCI failed to checkout the branch, but the rest of the tests work so it can't be an issue with the code. I'm guessing this is just a CircleCI thing or just bad luck.
Thank you!