MMseqs2 LCA fails with segmentation fault

Expected Behavior

Taxonomy assignment of viral OTU sequences (nucleotide) using the 2bLCA method against a custom formatted amino acid database from IMG/VR

Current Behavior

The LCA step dies due to a segmentation fault when using a small test dataset that I have previously had success with when using Antônio Camargo's ICTV MMseqs2 protein database (https://github.com/apcamargo/ictv-mmseqs2-protein-database).

For reference, I have also allocated 40 cores and 700gb RAM to this job, which fails after consuming only 178gb of mem.

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

I have formatted the IMG/VR v4 7.1 AA database as recommended (https://github.com/soedinglab/MMseqs2/wiki#create-a-seqtaxdb-by-manual-annotation-of-a-sequence-database) and I've created a custom taxdump using taxonkit. The custom taxdb was created without issue:

mmseqs createdb --dbtype 1 IMGVR_all_proteins-high_confidence.faa.gz IMG_tax_db/IMG_tax_db
createdb --dbtype 1 IMGVR_all_proteins-high_confidence.faa.gz IMG_tax_db/IMG_tax_db 

MMseqs Version:       	14.7e284
Database type         	1
Shuffle input database	true
Createdb mode         	0
Write lookup file     	1
Offset of numeric ids 	0
Compressed            	0
Verbosity             	3

Converting sequences
[112567430] 8m 8s 166mss
Time for merging to IMG_tax_db_h: 0h 0m 39s 840ms
Time for merging to IMG_tax_db: 0h 1m 54s 537ms
Database type: Aminoacid
Time for processing: 0h 14m 27s 634ms

#integrate all into a complete mmseqs2 taxdb
mmseqs createtaxdb IMG_tax_db/IMG_tax_db /home/bbrow6/tmp --ncbi-tax-dump IMG_taxdump --tax-mapping-file UVIG_taxid_mapping_cleaned

createtaxdb IMG_tax_db/IMG_tax_db /home/bbrow6/tmp --ncbi-tax-dump IMG_taxdump --tax-mapping-file UVIG_taxid_mapping_cleaned 

MMseqs Version:        	14.7e284
NCBI tax dump directory	IMG_taxdump
Taxonomy mapping file  	UVIG_taxid_mapping_cleaned
Taxonomy mapping mode  	0
Taxonomy db mode       	1
Threads                	28
Verbosity              	3

Loading nodes file ... Done, got 6986 nodes
Loading merged file ... Done, added 0 merged nodes.
Loading names file ... Done
Init RMQ ...Done

the job was submitted with teh following batch script, including params:

#PBS -M [email protected]
#PBS -m a
#PBS -l mem=700gb
#PBS -l nodes=1:ppn=40
#PBS -P a675a67f-9204-4f66-9785-891b95c7d3da
#PBS -q paidq
#PBS -o /home/bbrow6/script_output/job-mmseqs_easytax_050523.out
#PBS -e /home/bbrow6/script_error/job-mmseqs_easytax_050523.err


cd /home/bbrow6/taxonomy_stuffs
export DBs=/home/bbrow6/JGI/IMG_VR_2022_12_19_7.1/IMG_tax_db
export OTU_dir=/home/bbrow6/vaginal_virome/Run_021723/identified_viral_sequences/OTUs/geNomad/genomad_output_1000bps/clustered_spades_cross_assembly_contigs_gt1000bps_summary/

source activate mmseqs2
module load OpenMPI


mmseqs easy-taxonomy $OTU_dir/clustered_spades_cross_assembly_contigs_gt1000bps_virus.fna $DBs/IMG_tax_db vag_taxonomy_results_IMG tmp -e 1e-5 -s 6 --blacklist "" --tax-lineage 1

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

Full output and error attached below

tmp/10336174962539687461/taxonomy_tmp/11653652317365833767/tmp_taxonomy/6923600097584969791/taxonomy.sh: line 58: 78000 Segmentation fault (core dumped) "$MMSEQS" lca "${TARGET}" "${LCAIN}" "${RESULTS}" ${LCA_PAR}

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 14.7e284
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): bioconda
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): 40 cpus, 700gb RAM
Operating system and version:
Operating System: CentOS Linux 7 (Core)
Kernel: Linux 3.10.0-1160.62.1.el7.x86_64

job-mmseqs_easytax_050523_error.txt job-mmseqs_easytax_050523_out.txt

May 09 '23 02:05 itsmisterbrown

How large is the database you created? Would it be possible to share?

How does your tax mapping look like (UVIG_taxid_mapping_cleaned). It seems to create some very large taxid values (1446979566). Maybe I didn't correctly consider that they could be so large.

May 18 '23 03:05 milot-mirdita

I got a similar error with itsmisterbrown that the LCA step dies due to a segmentation fault. Here is my command line. And I also attached my log and error files. out.txt err.txt

   mmseqs easy-taxonomy \
    test.fasta nr.smag.mmetsp.gvog.faaDB \
    DB_NR.SMAG.DB_tax_result_test \
    tmp \
    --orf-filter 0 \
    --threads 16 \
    --lca-ranks superkingdom,phylum,class,order,family,genus \
    --split-memory-limit 500G

Please help me to find out what wrong with my command.

Dec 08 '23 02:12 yosei-yung