RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

RepeatMaskerLib.embl not built (DateRepeats)

Open EricDeveaud opened this issue 4 years ago • 4 comments

Describe the issue

RepeatMaskerLib.embl is not built while configuring RepeatMasker-4.1.2-p1 and is requestrd by DateRepeats

rpm_maker:RepeatMasker/RepeatMasker-4.1.2-p1 > DateRepeats 
Indicate directory with the RepeatMasker repeat libraries near line 136 of /opt/gensoft/exe/RepeatMasker/4.1.2-p1/bin/DateRepeats

Reproduction steps

wget https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.2-p1.tar.gz
tar xf RepeatMasker-4.1.2-p1.tar.gz
mv RepeatMasker RepeatMasker-4.1.2-p1 && cd RepeatMasker-4.1.2-p1
tar xf ${HOME}/RepBaseRepeatMaskerEdition-20181026.tar.gz
wget https://www.dfam.org/releases/Dfam_3.1/families/Dfam.embl.gz
gunzip  -c Dfam.embl.gz > Libraries/Dfam.embl
module load rmblastn/2.10.0 \
            phrap/1.090518 \
            hmmer/3.2.1 \
            trf/4.09
perl configure -rmblast_dir $(dirname $(command -v rmblastn)) \
               -crossmatch_dir $(dirname $(command -v  cross_match)) \
               -hmmer_dir $(dirname $(command -v hmmconvert)) \
               -trf_prgm $(command -v trf) \
               -default_search_engine rmblast

Log output

 -- Setting perl interpreter...
RepeatMasker Configuration Program


Checking for libraries...

Rebuilding RepeatMaskerLib.h5 master library
  - Read in 49011 sequences from /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RMRBSeqs.embl
  - Read in 49011 annotations from /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RMRBMeta.embl
  Merging Dfam + RepBase into RepeatMaskerLib.h5 library..........................................

File: /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RepeatMaskerLib.h5
Database: Dfam withRBRM
Version: 3.3
Date: 2020-11-09

Dfam - A database of transposable element (TE) sequence alignments and HMMs.
RBRM - RepBase RepeatMasker Edition - version 20181026

Total consensus sequences: 51780
Total HMMs: 6915

.
Building FASTA version of RepeatMasker.lib .......................
Building RMBlast frozen libraries..
The program is installed with a the following repeat libraries:
File: /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RepeatMaskerLib.h5
Database: Dfam withRBRM
Version: 3.3
Date: 2020-11-09

Dfam - A database of transposable element (TE) sequence alignments and HMMs.
RBRM - RepBase RepeatMasker Edition - version 20181026

Total consensus sequences: 51780
Total HMMs: 6915


Further documentation on the program may be found here:
  /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/repeatmasker.help

BUT !

ls Libraries/
Artefacts.embl   RMRBSeqs.embl            RepeatMasker.lib.nsq  RepeatPeps.lib.pin
Dfam.embl        RepeatAnnotationData.pm  RepeatMasker.lib.ntf  RepeatPeps.lib.pot
Dfam.h5          RepeatMasker.lib         RepeatMasker.lib.nto  RepeatPeps.lib.psq
README.RMRBSeqs  RepeatMasker.lib.ndb     RepeatMaskerLib.h5    RepeatPeps.lib.ptf
README.meta      RepeatMasker.lib.nhr     RepeatPeps.lib        RepeatPeps.lib.pto
RMRB.embl        RepeatMasker.lib.nin     RepeatPeps.lib.pdb    RepeatPeps.readme
RMRBMeta.embl    RepeatMasker.lib.not     RepeatPeps.lib.phr    taxonomy.dat

and

./DateRepeats
Indicate directory with the RepeatMasker repeat libraries near line 135 of ./DateRepeats

no RepeatMasker.embl required by DateRepeats

Environment (please include as much of the following information as you can find out):

perl: 5.30.1
Python: version 3.8.1 (hdf5py 3.6.0)
rmblastn: version 2.10.0
phrap: version 1.090518
hmmer: version 3.2.1
trf: version 4.09
  • How did you install RepeatMasker? manual installation from repeatmasker.org from tar.gz archive

  • Which version of RepeatMasker do you have?

./RepeatMasker -v        
RepeatMasker version 4.1.2-p1
  • Operating system and version. The output of uname -a and lsb_release -a can be used to find this.
 uname -a
Linux 1b305326d2fe 4.18.0-240.22.1.el8_3.x86_64 #1 SMP Thu Apr 8 19:01:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Additional context version 4.1.0 previously installed works as expected.

EricDeveaud avatar Jan 31 '22 14:01 EricDeveaud

This is indeed a problem. DateRepeats is quite an old tool and may need some modifications in order to make it work with the new *.h5 database format. I will let you know if I can find a quick workaround.

rmhubley avatar Jul 29 '22 19:07 rmhubley

DateRepeats 4.1.2 is also failing at UCSC Genome Browser building our hg38 patch 14. We use it to strip out the human specific repeats.

I added the famdbfile setting to DateRepeats so it does not complain about famdbfile path not found: my $tax = Taxonomy->new( taxonomyDataFile => $taxFile, famdbfile => "$dir/RepeatMaskerLib.h5");

However, it runs for more than 27 hours using CPU the whole time until I killed it.

With RM version 4.1.0, all the small patch chromosomes finished in just about one minute each.

Please let me know if it would be handy to supply the commandline and input file for testing.

galt avatar Oct 12 '22 22:10 galt

Hanging command is: DateRepeats chr5_MU273352v1_fix.txt -query human -comp 'mus musculus'

chr5_MU273352v1_fix.txt

galt avatar Oct 12 '22 23:10 galt

Thanks Galt. I removed DateRepeats in the latest version (4.1.4) as it needs refactoring. I will make sure this is a high priority for the next release.

rmhubley avatar Nov 09 '22 17:11 rmhubley