Easy-taxonomy and floating point exception

Open genomewalker opened this issue 4 years ago • 1 comments

I am having some strange behavior when using the easy-taxonomy with a query with only one contig. Here you can find the contig and the log

I don't think it is related to https://github.com/soedinglab/MMseqs2/issues/31 or https://github.com/soedinglab/MMseqs2/issues/447

The DB seems fine. I processed hundreds of samples using the same MMseqs2 command, and I only had problems in files with one contig. The samples are complicated to assemble, and sometimes I only can recover one contig.

Current Behavior

When running

mmseqs easy-taxonomy /vol/cloud/geogenetics/KapK/results/assembly-refined/477fb4bafa.assm.refined.fasta             /vol/cloud/geogenetics/DBs/tax/GTDB /vol/cloud/geogenetics/KapK/results/contig-taxonomy/477fb4bafa.GTDB             /dev/shm/tmp/contig-taxonomy/477fb4bafa.tax.GTDB --tax-lineage 2 --majority 0.8 --vote-mode 1 --lca-mode 3 --orf-filter 1 --lca-ranks superkingdom,phylum,class,order,family,genus --threads 32 >> /vol/cloud/geogenetics/KapK/results/logs/contig-taxonomy/477fb4bafa.contig-taxonomy.GTDB.log

it produces:

prefilter /dev/shm/tmp/contig-taxonomy/477fb4bafa.tax.GTDB/4763407151393146292/taxonomy_tmp/18161437552067976221/orfs_filter /vol/cloud/geogenetics/DBs/tax/GTDB /dev/shm/tmp/contig-taxonomy/477fb4bafa.tax.GTDB/4763407151393146292/taxonomy_tmp/18161437552067976221/tmp_taxonomy/11319500873502674595/tmp_hsp1/8343158458908834442/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 32 --compressed 0 -v 3 -s 2.0

Query database size: 0 type: Aminoacid
Target split mode. Searching through 4 splits
Estimated memory consumption: 149G
Target database size: 152631149 type: Aminoacid
Process prefiltering step 1 of 4

Index table k-mer threshold: 163 at k-mer size 7
Index table: counting k-mers
[=================================================================] 38.15M 2m 48s 107ms
Index table: Masked residues: 98253761
Index table: fill
[=================================================================Floating point exception (core dumped)
Error: Prefilter died
Error: First search died
Error: taxonomy died
Error: Search died

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 24f6b52a38cd8cf66d10ce00bf37dc815fef986e Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): Self-compiled For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: GCC 7.5.0 cmake 3.10.2 Server specifications (especially CPU support for AVX2/SSE and amount of system memory): AVX2/SSE supported, 512G Operating system and version: Ubuntu 18.04

Jul 15 '21 08:07 genomewalker

The filtering step removes all 15 fragments extracted from this contig and passes an empty database to the normal prefiltering step.

We need to more carefully handle empty input. This will require a bit of refactoring to fix.

Jul 15 '21 21:07 milot-mirdita