RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

Can I run ProcessRepeats in parallel?

Open life404 opened this issue 2 years ago • 2 comments

What do you want to know? Can I run ProcessRepeats in parallel?

Helpful context

  • Is there a particular genome assembly or organism your question is about? If possible, please provide a link to a publicly available assembly and/or a species name. No
  • Have you installed RepBase RepeatMasker Edition for RepeatMasker? No

Dear Robert,

I have a 1.7 GB bat genome and conducted RepeatMasker analysis using mammalian repeat records (~295 MB) from Dfam:

~/TOOLS/TETools/TETools.sif famdb.py -i ~/TOOLS/TETools/Libraries families --format fasta_name --include-class-in-name --ancestors --descendants 'Mammalia' > Dfam-Mammalia.fa

~/TOOLS/TETools/TETools.sif RepeatMasker -pa 96 -a -e ncbi -dir . -nolow -lib Dfam-Mammalia.fa -xsmall -gff genome 2>&1 | tee repeatMasker.log

However, I encountered slow analysis with ProcessRepeat, which reads all cat files into memory and performs the analysis in a single thread, taking over 12 hours. To expedite the process, I am considering splitting the genome by chromosomes and running ProcessRepeats in parallel. Is this feasible?

Thank you so much.

life404 avatar Jan 12 '24 00:01 life404

You could split the *.cat file by sequences and run them independently through ProcessRepeats. We typically run large genomes by splitting them into 50MB chunks and running them through full RepeatMasker runs on a cluster (See RepeatMasker_Nextflow script here: https://github.com/Dfam-consortium/RepeatMasker_Nextflow).

rmhubley avatar Jan 18 '24 23:01 rmhubley

Thank you so much for your advice. I will try it.

life404 avatar Jan 19 '24 01:01 life404