Run big size genome
Hi,
I would like to compare multiple genomes of about 2Gb in size, but the run results say 'The input sequence Zm-A188-REFERENCE-KSU-1.0 is too long. The technical limit is 1073741823. ', is there any good solution?
Thank you for your help!
Hi,
Thanks for your interest in phylonium. The reason for the length limit is the suffix array construction. We are using a thirdparty library libdivsufsort. By default that one only supports 32bit indices. As we need to build a suffix array of the forward and reverse string of the reference the length of the reference is limited to 2^31 - 1.
Creating a 64bit version should be possible but requires a bit of work. We would have to double the bit width of all our data structures and the call the correct version of libdivsufsort. Don't think that I have the bandwidth to work on that any time soon, but pull requests are always appreciated.
Thank you for your timely reply, looking forward to the next version of phylonium!
I read a paper criticizing phylonium for not supporting >1Gbp genomes. @kloetzl, I know you have graduated, but it would be good to support 64-bit suffix arrays. There are other alignment-free tools but it is not always easy to run them.
Thanks for reminding me of this issue. I can give it a go and build a 64bit suffix array version. That shouldn't be an issue as long as divsufsort64 is available. I just don't want to use 64bit by default as that comes with a memory and runtime overhead.
Will put it back on my todo list.
I'm rather proud of my past self that not only did it add a check for sequence length, but also added a useful error message. Furthermore, it was surprisingly easy to add support for longer sequences. I've created a separate branch for now. Will do some more testing before releasing a new version.
I have just released phylonium 1.7 with 64bit indexes by default. Have fun!