Fasta file contains a sequence identifier which is too long - with shorter headers and run fails
I have received this error when running a genome in RepeatMasker:
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 ) at /usr/local/RepeatMasker/RepeatMasker line 1541.
WARNING: Retrying batch ( 9151 ) [ 25,, 58446]...
This then eventually fails
However upon looking at all of the headers in my fasta file, the longest is 46 characters, which is still less than the max id length, so I am wondering why this error has occurred?
Is the true maximum header length less than 50 characters?
I made a short test file with a 46-character header and it was fine. In fact, exactly 50 characters is fine and 51 fails.
Can you share a link to your exact fasta file if it's publicly available, and/or post the sequence headers (i.e. the output of grep -E '^>' genome.fa)? It would be helpful to debug if there is some subtle issue with the counting.
Unfortunately the fasta isn't publicly available, but the headers are here:
https://drive.google.com/file/d/15mNTeTSvi30BZTOdQ6tB9kpl-q1-ovFb/view?usp=sharing
Ah, I think I see what's happening now. Long sequences -- longer than I tested for my previous comment -- are split into multiple files and the text frag-<number> is added to the sequence name to keep track. So the effective maximum header length does end up being closer to 40 or 45, for sequences that have to be split up.
Ah Okay! This makes sense! I have implemented a pre-processing step in my pipeline to rename headers, so this should be fine for me now