Can I run ProcessRepeats in parallel?
What do you want to know?
Can I run ProcessRepeats in parallel?
Helpful context
- Is there a particular genome assembly or organism your question is about? If possible, please provide a link to a publicly available assembly and/or a species name. No
- Have you installed RepBase RepeatMasker Edition for RepeatMasker? No
Dear Robert,
I have a 1.7 GB bat genome and conducted RepeatMasker analysis using mammalian repeat records (~295 MB) from Dfam:
~/TOOLS/TETools/TETools.sif famdb.py -i ~/TOOLS/TETools/Libraries families --format fasta_name --include-class-in-name --ancestors --descendants 'Mammalia' > Dfam-Mammalia.fa
~/TOOLS/TETools/TETools.sif RepeatMasker -pa 96 -a -e ncbi -dir . -nolow -lib Dfam-Mammalia.fa -xsmall -gff genome 2>&1 | tee repeatMasker.log
However, I encountered slow analysis with ProcessRepeat, which reads all cat files into memory and performs the analysis in a single thread, taking over 12 hours. To expedite the process, I am considering splitting the genome by chromosomes and running ProcessRepeats in parallel. Is this feasible?
Thank you so much.
You could split the *.cat file by sequences and run them independently through ProcessRepeats. We typically run large genomes by splitting them into 50MB chunks and running them through full RepeatMasker runs on a cluster (See RepeatMasker_Nextflow script here: https://github.com/Dfam-consortium/RepeatMasker_Nextflow).
Thank you so much for your advice. I will try it.