RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

RepeatMasker running in Docker container fails on MacOS but not on Linux

Open reslp opened this issue 5 years ago • 6 comments

Hi,

I came across a weird problem with Repeatmasker running inside a Docker container. I have been using the https://github.com/Dfam-consortium/TETools container, but this also happens in my own containers as well.

The problem occurs only when the container is run on MacOS and not on Linux (Ubuntu 18.04). On both systems I use the same container and run the same command to start RepeatMasker:

$ docker run -it --rm -v $(pwd):/data dfam/tetools:1.2
(dfam-tetools /)# RepeatMasker Agyrium_rufum_sorted.fas

On a MacOS host RepeatMasker fails with this output:

RepeatMasker version 4.1.1
Search Engine: NCBI/RMBLAST [ 2.10.0+ ]

Using Master RepeatMasker Database: /opt/RepeatMasker/Libraries/RepeatMaskerLib.h5
  Title    : Dfam
  Version  : 3.2
  Date     : 2020-07-02
  Families : 6,953

Species/Taxa Search:
  Homo sapiens [NCBI Taxonomy ID: 9606]
  Lineage: root;cellular organisms;Eukaryota;Opisthokonta;Metazoa;
           Eumetazoa;Bilateria;Deuterostomia;Chordata;
           Craniata <chordates>;Vertebrata <vertebrates>;
           Gnathostomata <vertebrates>;Teleostomi;Euteleostomi;
           Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;
           Mammalia;Theria <mammals>;Eutheria;Boreoeutheria;
           Euarchontoglires;Primates;Haplorrhini;Simiiformes
  1337 families in ancestor taxa; 8 lineage-specific families

Building general libraries in: /opt/RepeatMasker/Libraries/CONS-Dfam_3.2/general
Building species libraries in: /opt/RepeatMasker/Libraries/CONS-Dfam_3.2/homo_sapiens

analyzing file Agyrium_rufum_sorted.fas
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!

Checking for E. coli insertion elements
identifying Simple Repeats in batch 1 of 5
identifying full-length ALUs in batch 1 of 5
identifying full-length interspersed repeats in batch 1 of 5
identifying remaining ALUs in batch 1 of 5
identifying most interspersed repeats in batch 1 of 5
identifying long interspersed repeats in batch 1 of 5
identifying ancient repeats in batch 1 of 5
identifying retrovirus-like sequences in batch 1 of 5
identifying Simple Repeats in batch 1 of 5
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!

Checking for E. coli insertion elements
WARNING: The search engine returned an error (3, status = 3 )
Engine parameters: /opt/rmblast/bin/rmblastn  -num_alignments 9999999 -db /opt/RepeatMasker/Libraries/CONS-Dfam_3.2/general/is.lib -query /data/RM_11.FriNov60803462020/Agyrium_rufum_sorted.fas_batch-2.masked -gapopen 12 -gapextend 2 -complexity_adjust  -word_size 15 -xdrop_ungap 34 -xdrop_gap_final 17 -xdrop_gap 8  -min_raw_gapped_score 17 -dust no  -num_threads 4  -matrix identity.matrix
A search phase could not complete on this batch.
The batch file will be re-run and if possible the
program will resume.
WARNING: Retrying batch ( 2 ) [ 255,, 18]...
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!

Checking for E. coli insertion elements
WARNING: The search engine returned an error (3, status = 3 )
Engine parameters: /opt/rmblast/bin/rmblastn  -num_alignments 9999999 -db /opt/RepeatMasker/Libraries/CONS-Dfam_3.2/general/is.lib -query /data/RM_11.FriNov60803462020/Agyrium_rufum_sorted.fas_batch-2.masked -gapopen 12 -gapextend 2 -complexity_adjust  -word_size 15 -xdrop_ungap 34 -xdrop_gap_final 17 -xdrop_gap 8  -min_raw_gapped_score 17 -dust no  -num_threads 4  -matrix identity.matrix
A search phase could not complete on this batch.
The batch file will be re-run and if possible the
program will resume.
WARNING: Retrying batch ( 2 ) [ 255,, 18]...
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!

Checking for E. coli insertion elements
WARNING: The search engine returned an error (3, status = 3 )
Engine parameters: /opt/rmblast/bin/rmblastn  -num_alignments 9999999 -db /opt/RepeatMasker/Libraries/CONS-Dfam_3.2/general/is.lib -query /data/RM_11.FriNov60803462020/Agyrium_rufum_sorted.fas_batch-2.masked -gapopen 12 -gapextend 2 -complexity_adjust  -word_size 15 -xdrop_ungap 34 -xdrop_gap_final 17 -xdrop_gap 8  -min_raw_gapped_score 17 -dust no  -num_threads 4  -matrix identity.matrix
A search phase could not complete on this batch.
The batch file will be re-run and if possible the
program will resume.


FATAL ERROR: RepeatMasker giving up. One or more
batches failed!  Unfortunately this type of error
cannot be recovered from. Please submit the following
details to the feedback page at the repeatmasker
website:

       http://www.repeatmasker.org

RepeatMasker Version: 4.1.1
Library Version: CONS-Dfam_3.2
Search Engine: ncbi [ 2.10.0+ ]
Command Line: /opt/RepeatMasker/RepeatMaskerAgyrium_rufum_sorted.fas
Batch Number: 2
Disk Space:
Filesystem      1K-blocks      Used  Available Use% Mounted on
grpcfuse       1761719272 663393948 1085442344  38% /data

System Memory:
MemTotal:        8156960 kB
MemFree:          918436 kB
MemAvailable:    7567300 kB
Cached:          6145984 kB
SwapCached:           16 kB
SwapTotal:       4194300 kB
SwapFree:        4193264 kB
Further details about this problem may be found in
the directory: /data/RM_11.FriNov60803462020

On Linux RepeatMasker appears to run normally. Here is part of the output on Linux:

RepeatMasker version 4.1.1
Search Engine: NCBI/RMBLAST [ 2.10.0+ ]

Using Master RepeatMasker Database: /opt/RepeatMasker/Libraries/RepeatMaskerLib.h5
  Title    : Dfam
  Version  : 3.2
  Date     : 2020-07-02
  Families : 6,953

Species/Taxa Search:
  Homo sapiens [NCBI Taxonomy ID: 9606]
  Lineage: root;cellular organisms;Eukaryota;Opisthokonta;Metazoa;
           Eumetazoa;Bilateria;Deuterostomia;Chordata;
           Craniata <chordates>;Vertebrata <vertebrates>;
           Gnathostomata <vertebrates>;Teleostomi;Euteleostomi;
           Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;
           Mammalia;Theria <mammals>;Eutheria;Boreoeutheria;
           Euarchontoglires;Primates;Haplorrhini;Simiiformes
  1337 families in ancestor taxa; 8 lineage-specific families

Building general libraries in: /opt/RepeatMasker/Libraries/CONS-Dfam_3.2/general
Building species libraries in: /opt/RepeatMasker/Libraries/CONS-Dfam_3.2/homo_sapiens

analyzing file Agyrium_rufum_sorted.fas

Checking for E. coli insertion elements
identifying Simple Repeats in batch 1 of 558
identifying full-length ALUs in batch 1 of 558
identifying full-length interspersed repeats in batch 1 of 558
identifying remaining ALUs in batch 1 of 558
identifying most interspersed repeats in batch 1 of 558
identifying long interspersed repeats in batch 1 of 558
identifying ancient repeats in batch 1 of 558
identifying retrovirus-like sequences in batch 1 of 558
identifying Simple Repeats in batch 1 of 558

Checking for E. coli insertion elements
identifying Simple Repeats in batch 2 of 558
identifying full-length ALUs in batch 2 of 558
identifying full-length interspersed repeats in batch 2 of 558
identifying remaining ALUs in batch 2 of 558
identifying most interspersed repeats in batch 2 of 558
identifying long interspersed repeats in batch 2 of 558
identifying ancient repeats in batch 2 of 558
identifying retrovirus-like sequences in batch 2 of 558
identifying Simple Repeats in batch 2 of 558

I am a bit confused what is happening here, there should be no differences running this on different systems as far as I am aware. I would be grateful for any hints on how to resolve this problem or information about what might be going on here.

Many thanks!

best,

Philipp

reslp avatar Nov 06 '20 08:11 reslp

Something looks very wrong around here:

analyzing file Agyrium_rufum_sorted.fas
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )! Check data before proceeding!

This error message is supposed to indicate an invalid nucleotide or IUB code (one of ACGTBDHVRYKMSWNX), but those all look fine.

Since you're running the same container on two hosts a difference in versions or operating-system specific code seems unlikely to me, unless there is a very low-level bug here. Can you provide the output of locale and of printenv in both places where you run the RepeatMasker command? (with usernames or sensitive paths redacted if necessary).

jebrosen avatar Nov 06 '20 22:11 jebrosen

Hi,

thank you for your quick reply! There is no difference in the output of the two commands on the different hosts. I should have mentioned that already in my original post. Here is the output from locale and printenv inside the container on a MacOS host:

(dfam-tetools /)# locale
LANG=C
LANGUAGE=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
(dfam-tetools /)# printenv
LANG=C
HOSTNAME=1a4e3ef436ad
PYTHONIOENCODING=utf8
PWD=/
HOME=/root
TERM=xterm
SHLVL=1
PATH=/opt/RepeatMasker:/opt/RepeatMasker/util:/opt/RepeatModeler:/opt/RepeatModeler/util:/opt/coseg:/opt:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
_=/usr/bin/printenv

This is the output on Linux:

(dfam-tetools /)# locale
LANG=C
LANGUAGE=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
(dfam-tetools /)# printenv
LANG=C
HOSTNAME=60bcf7f5f726
PYTHONIOENCODING=utf8
PWD=/
HOME=/root
TERM=xterm
SHLVL=1
PATH=/opt/RepeatMasker:/opt/RepeatMasker/util:/opt/RepeatModeler:/opt/RepeatModeler/util:/opt/coseg:/opt:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
_=/usr/bin/printenv

best,

Philipp

reslp avatar Nov 07 '20 13:11 reslp

Oh, of course the environment is the same as well -- I forgot docker isolated that. I was thinking of singularity, which preserves a lot more by default.

Well, ouch. This will be pretty hard for me to reproduce or debug since I don't currently have easy access to a macOS system. That said, I will try to minimize to only the code that's failing, so that running all of RepeatMasker isn't necessary to test this issue. It's very odd to me that such a simple-looking regex match in the code is going wrong.

Alternatively, maybe there is something interesting about this specific input file:

  • Does the error show up with other files, or only this one?
  • Is the file publicly available (or could be shared publicly or with us by email?)
    • If it is not, could you redirect RepeatMasker's output and error to a file, and then attach the file with the error message? Specifically, I'm curious if there are any nonprinting characters somewhere in the log line ( TGGAACGGCAACGAGATGGATGGCAACAGGTGTCTGAACGGTC )!, which might have been lost between terminal/copy+paste/comment.

jebrosen avatar Nov 09 '20 19:11 jebrosen

Hi,

This is getting more and more mysterious. I just tried it with several files and they all failed on my mac. Then I downloaded an Aspergillus genome from NCBI and it worked. I can't seem to find any difference between the files except one: In the NCBI file sequences are truncated at 80bp and in my file at 60. But that should not make the difference. This is the genome which ran successfully: https://www.ncbi.nlm.nih.gov/genome/12515

The file I mentioned in my original post is from an unpublished genome which I can't share, but I also have several published genomes which have been preprocessed the same way and fail with the same error. The files (including the one from NCBI that does work) have the same Unix line feeds and UTF-8 encoding. I am happy to send you one. The thing is, I copied the exact same file to my Linux box and there they work without problems.

Here is the output of a file containing only 1 scaffold of a publicly available genome that also fails. Maybe this would be a good test candidate:

(dfam-tetools /data/assemblies)# RepeatMasker Tri_scaffold1.fas
RepeatMasker version 4.1.1
Search Engine: NCBI/RMBLAST [ 2.10.0+ ]

Using Master RepeatMasker Database: /opt/RepeatMasker/Libraries/RepeatMaskerLib.h5
  Title    : Dfam
  Version  : 3.2
  Date     : 2020-07-02
  Families : 6,953

Species/Taxa Search:
  Homo sapiens [NCBI Taxonomy ID: 9606]
  Lineage: root;cellular organisms;Eukaryota;Opisthokonta;Metazoa;
           Eumetazoa;Bilateria;Deuterostomia;Chordata;
           Craniata <chordates>;Vertebrata <vertebrates>;
           Gnathostomata <vertebrates>;Teleostomi;Euteleostomi;
           Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;
           Mammalia;Theria <mammals>;Eutheria;Boreoeutheria;
           Euarchontoglires;Primates;Haplorrhini;Simiiformes
  1337 families in ancestor taxa; 8 lineage-specific families


analyzing file Tri_scaffold1.fas
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( AGTCCATTGTGATTGTTTTGCTCCGC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( AGTCCATTGTGATTGTTTTGCTCCGC )! Check data before proceeding!
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( AGTCCATTGTGATTGTTTTGCTCCGC )! Check data before proceeding!

Checking for E. coli insertion elements
identifying Simple Repeats in batch 1 of 2
identifying full-length ALUs in batch 1 of 2
identifying full-length interspersed repeats in batch 1 of 2
identifying remaining ALUs in batch 1 of 2
identifying most interspersed repeats in batch 1 of 2
identifying long interspersed repeats in batch 1 of 2
identifying ancient repeats in batch 1 of 2
identifying retrovirus-like sequences in batch 1 of 2
identifying Simple Repeats in batch 1 of 2
FastaDB::_getFastaRecords: Error could not interpret fasta line correctly ( AGTCCATTGTGATTGTTTTGCTCCGC )! Check data before proceeding!

Checking for E. coli insertion elements
identifying Simple Repeats in batch 2 of 2
identifying full-length ALUs in batch 2 of 2
identifying full-length interspersed repeats in batch 2 of 2
identifying remaining ALUs in batch 2 of 2
identifying most interspersed repeats in batch 2 of 2
identifying long interspersed repeats in batch 2 of 2
identifying ancient repeats in batch 2 of 2
identifying retrovirus-like sequences in batch 2 of 2
identifying Simple Repeats in batch 2 of 2
processing output:
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
cycle 8
cycle 9
cycle 10
Generating output...
FastaDB::_indexOnly - WARNING: This line doesn't appear to be in a format I recognize ( AGTCCATTGTGATTGTTTTGCTCCGC
 ). Sequence is being ignored.
Illegal division by zero at /opt/RepeatMasker/ProcessRepeats line 9256.

best,

Philipp

reslp avatar Nov 10 '20 15:11 reslp

If you would be able to send one or two failing files to me ([email protected]) or attach them here, that would be super helpful as a test case.

Here is a short perl script that contains the same regexes as FastaDB. It prints only the "bad" lines found in any files named on the command line. Hopefully this code sees the same "bad" files/lines as the real FastaDB.

#!/usr/bin/env perl

while (<>) {
  s/[\n\r]//g;
  next unless /\S/;

  if ( /^\s*\>\s*(\S+)\s*(.*)/ ) {
    # ok
  }
  elsif ( /^([ACGTBDHVRYKMSWNXacgtbdhvrykmswnx]+)$/ ) {
    # ok
  }
  else {
    print "$_\n";
  }
}

jebrosen avatar Nov 10 '20 23:11 jebrosen

Thank you, I just tried your script and it did not give any output, so it does not seem to find any problematic lines. I uploaded the problematic file I tested: Tri_scaffold1.log. I renamed it to .log because github does not let me upload .fas files.

reslp avatar Nov 11 '20 09:11 reslp