phylonium icon indicating copy to clipboard operation
phylonium copied to clipboard

Run big size genome

Open MrbrilliantLL opened this issue 3 years ago • 2 comments

Hi,

I would like to compare multiple genomes of about 2Gb in size, but the run results say 'The input sequence Zm-A188-REFERENCE-KSU-1.0 is too long. The technical limit is 1073741823. ', is there any good solution?

Thank you for your help!

MrbrilliantLL avatar Jul 13 '22 10:07 MrbrilliantLL

Hi,

Thanks for your interest in phylonium. The reason for the length limit is the suffix array construction. We are using a thirdparty library libdivsufsort. By default that one only supports 32bit indices. As we need to build a suffix array of the forward and reverse string of the reference the length of the reference is limited to 2^31 - 1.

Creating a 64bit version should be possible but requires a bit of work. We would have to double the bit width of all our data structures and the call the correct version of libdivsufsort. Don't think that I have the bandwidth to work on that any time soon, but pull requests are always appreciated.

kloetzl avatar Jul 13 '22 11:07 kloetzl

Thank you for your timely reply, looking forward to the next version of phylonium!

MrbrilliantLL avatar Jul 13 '22 11:07 MrbrilliantLL

I read a paper criticizing phylonium for not supporting >1Gbp genomes. @kloetzl, I know you have graduated, but it would be good to support 64-bit suffix arrays. There are other alignment-free tools but it is not always easy to run them.

lh3 avatar Mar 06 '23 00:03 lh3

Thanks for reminding me of this issue. I can give it a go and build a 64bit suffix array version. That shouldn't be an issue as long as divsufsort64 is available. I just don't want to use 64bit by default as that comes with a memory and runtime overhead.

Will put it back on my todo list.

kloetzl avatar Mar 06 '23 19:03 kloetzl

I'm rather proud of my past self that not only did it add a check for sequence length, but also added a useful error message. Furthermore, it was surprisingly easy to add support for longer sequences. I've created a separate branch for now. Will do some more testing before releasing a new version.

kloetzl avatar Mar 10 '23 20:03 kloetzl

I have just released phylonium 1.7 with 64bit indexes by default. Have fun!

kloetzl avatar Apr 22 '23 12:04 kloetzl