Ideas for Assembling an Extremely Large Dataset

Open howla1ke opened this issue 1 year ago • 1 comments

Hello, I have NovaSeq 150 bp PE data, that was run on 2 separate runs to obtain the quantity of data we needed. I want to co-assemble both of these, but my dilemma is that I can only allocate 996 GB of RAM. My job was killed because it ran out of memory and it was noted it the spades log that I need approximately 1118 GB of RAM to assemble. Would it be advised to perform the error correction only step separately on each run and then try to co-assemble the output of both of those on assembler only? Is that possible? Do you have any ideas beyond normalizing the data? Thank you, for your time.

Sep 13 '24 14:09 howla1ke

One approach is using longer k-mer. I found that the default k-mer set for metagenomics is not enough. Longer k-mer will reduce RAM consumption during tandem-repeat resolution. If more SSD is available, personally I use 21,33,55,77 or 21,33,55,77,99,127

Sep 24 '24 21:09 yqy6611