RepeatMaskerLib.embl not built (DateRepeats)
Describe the issue
RepeatMaskerLib.embl is not built while configuring RepeatMasker-4.1.2-p1 and is requestrd by DateRepeats
rpm_maker:RepeatMasker/RepeatMasker-4.1.2-p1 > DateRepeats
Indicate directory with the RepeatMasker repeat libraries near line 136 of /opt/gensoft/exe/RepeatMasker/4.1.2-p1/bin/DateRepeats
Reproduction steps
wget https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.2-p1.tar.gz
tar xf RepeatMasker-4.1.2-p1.tar.gz
mv RepeatMasker RepeatMasker-4.1.2-p1 && cd RepeatMasker-4.1.2-p1
tar xf ${HOME}/RepBaseRepeatMaskerEdition-20181026.tar.gz
wget https://www.dfam.org/releases/Dfam_3.1/families/Dfam.embl.gz
gunzip -c Dfam.embl.gz > Libraries/Dfam.embl
module load rmblastn/2.10.0 \
phrap/1.090518 \
hmmer/3.2.1 \
trf/4.09
perl configure -rmblast_dir $(dirname $(command -v rmblastn)) \
-crossmatch_dir $(dirname $(command -v cross_match)) \
-hmmer_dir $(dirname $(command -v hmmconvert)) \
-trf_prgm $(command -v trf) \
-default_search_engine rmblast
Log output
-- Setting perl interpreter...
RepeatMasker Configuration Program
Checking for libraries...
Rebuilding RepeatMaskerLib.h5 master library
- Read in 49011 sequences from /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RMRBSeqs.embl
- Read in 49011 annotations from /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RMRBMeta.embl
Merging Dfam + RepBase into RepeatMaskerLib.h5 library..........................................
File: /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RepeatMaskerLib.h5
Database: Dfam withRBRM
Version: 3.3
Date: 2020-11-09
Dfam - A database of transposable element (TE) sequence alignments and HMMs.
RBRM - RepBase RepeatMasker Edition - version 20181026
Total consensus sequences: 51780
Total HMMs: 6915
.
Building FASTA version of RepeatMasker.lib .......................
Building RMBlast frozen libraries..
The program is installed with a the following repeat libraries:
File: /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RepeatMaskerLib.h5
Database: Dfam withRBRM
Version: 3.3
Date: 2020-11-09
Dfam - A database of transposable element (TE) sequence alignments and HMMs.
RBRM - RepBase RepeatMasker Edition - version 20181026
Total consensus sequences: 51780
Total HMMs: 6915
Further documentation on the program may be found here:
/opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/repeatmasker.help
BUT !
ls Libraries/
Artefacts.embl RMRBSeqs.embl RepeatMasker.lib.nsq RepeatPeps.lib.pin
Dfam.embl RepeatAnnotationData.pm RepeatMasker.lib.ntf RepeatPeps.lib.pot
Dfam.h5 RepeatMasker.lib RepeatMasker.lib.nto RepeatPeps.lib.psq
README.RMRBSeqs RepeatMasker.lib.ndb RepeatMaskerLib.h5 RepeatPeps.lib.ptf
README.meta RepeatMasker.lib.nhr RepeatPeps.lib RepeatPeps.lib.pto
RMRB.embl RepeatMasker.lib.nin RepeatPeps.lib.pdb RepeatPeps.readme
RMRBMeta.embl RepeatMasker.lib.not RepeatPeps.lib.phr taxonomy.dat
and
./DateRepeats
Indicate directory with the RepeatMasker repeat libraries near line 135 of ./DateRepeats
no RepeatMasker.embl required by DateRepeats
Environment (please include as much of the following information as you can find out):
perl: 5.30.1
Python: version 3.8.1 (hdf5py 3.6.0)
rmblastn: version 2.10.0
phrap: version 1.090518
hmmer: version 3.2.1
trf: version 4.09
-
How did you install RepeatMasker? manual installation from repeatmasker.org from tar.gz archive
-
Which version of RepeatMasker do you have?
./RepeatMasker -v
RepeatMasker version 4.1.2-p1
- Operating system and version. The output of
uname -aandlsb_release -acan be used to find this.
uname -a
Linux 1b305326d2fe 4.18.0-240.22.1.el8_3.x86_64 #1 SMP Thu Apr 8 19:01:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Additional context version 4.1.0 previously installed works as expected.
This is indeed a problem. DateRepeats is quite an old tool and may need some modifications in order to make it work with the new *.h5 database format. I will let you know if I can find a quick workaround.
DateRepeats 4.1.2 is also failing at UCSC Genome Browser building our hg38 patch 14. We use it to strip out the human specific repeats.
I added the famdbfile setting to DateRepeats so it does not complain about famdbfile path not found: my $tax = Taxonomy->new( taxonomyDataFile => $taxFile, famdbfile => "$dir/RepeatMaskerLib.h5");
However, it runs for more than 27 hours using CPU the whole time until I killed it.
With RM version 4.1.0, all the small patch chromosomes finished in just about one minute each.
Please let me know if it would be handy to supply the commandline and input file for testing.
Hanging command is: DateRepeats chr5_MU273352v1_fix.txt -query human -comp 'mus musculus'
Thanks Galt. I removed DateRepeats in the latest version (4.1.4) as it needs refactoring. I will make sure this is a high priority for the next release.