LCA fails with segmentation fault
Expected Behavior
Taxonomy assignment of viral OTU sequences (nucleotide) using the 2bLCA method against a custom formatted amino acid database from IMG/VR
Current Behavior
The LCA step dies due to a segmentation fault when using a small test dataset that I have previously had success with when using Antônio Camargo's ICTV MMseqs2 protein database (https://github.com/apcamargo/ictv-mmseqs2-protein-database).
For reference, I have also allocated 40 cores and 700gb RAM to this job, which fails after consuming only 178gb of mem.
Steps to Reproduce (for bugs)
Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.
I have formatted the IMG/VR v4 7.1 AA database as recommended (https://github.com/soedinglab/MMseqs2/wiki#create-a-seqtaxdb-by-manual-annotation-of-a-sequence-database) and I've created a custom taxdump using taxonkit. The custom taxdb was created without issue:
mmseqs createdb --dbtype 1 IMGVR_all_proteins-high_confidence.faa.gz IMG_tax_db/IMG_tax_db
createdb --dbtype 1 IMGVR_all_proteins-high_confidence.faa.gz IMG_tax_db/IMG_tax_db
MMseqs Version: 14.7e284
Database type 1
Shuffle input database true
Createdb mode 0
Write lookup file 1
Offset of numeric ids 0
Compressed 0
Verbosity 3
Converting sequences
[112567430] 8m 8s 166mss
Time for merging to IMG_tax_db_h: 0h 0m 39s 840ms
Time for merging to IMG_tax_db: 0h 1m 54s 537ms
Database type: Aminoacid
Time for processing: 0h 14m 27s 634ms
#integrate all into a complete mmseqs2 taxdb
mmseqs createtaxdb IMG_tax_db/IMG_tax_db /home/bbrow6/tmp --ncbi-tax-dump IMG_taxdump --tax-mapping-file UVIG_taxid_mapping_cleaned
createtaxdb IMG_tax_db/IMG_tax_db /home/bbrow6/tmp --ncbi-tax-dump IMG_taxdump --tax-mapping-file UVIG_taxid_mapping_cleaned
MMseqs Version: 14.7e284
NCBI tax dump directory IMG_taxdump
Taxonomy mapping file UVIG_taxid_mapping_cleaned
Taxonomy mapping mode 0
Taxonomy db mode 1
Threads 28
Verbosity 3
Loading nodes file ... Done, got 6986 nodes
Loading merged file ... Done, added 0 merged nodes.
Loading names file ... Done
Init RMQ ...Done
the job was submitted with teh following batch script, including params:
#PBS -M [email protected]
#PBS -m a
#PBS -l mem=700gb
#PBS -l nodes=1:ppn=40
#PBS -P a675a67f-9204-4f66-9785-891b95c7d3da
#PBS -q paidq
#PBS -o /home/bbrow6/script_output/job-mmseqs_easytax_050523.out
#PBS -e /home/bbrow6/script_error/job-mmseqs_easytax_050523.err
cd /home/bbrow6/taxonomy_stuffs
export DBs=/home/bbrow6/JGI/IMG_VR_2022_12_19_7.1/IMG_tax_db
export OTU_dir=/home/bbrow6/vaginal_virome/Run_021723/identified_viral_sequences/OTUs/geNomad/genomad_output_1000bps/clustered_spades_cross_assembly_contigs_gt1000bps_summary/
source activate mmseqs2
module load OpenMPI
mmseqs easy-taxonomy $OTU_dir/clustered_spades_cross_assembly_contigs_gt1000bps_virus.fna $DBs/IMG_tax_db vag_taxonomy_results_IMG tmp -e 1e-5 -s 6 --blacklist "" --tax-lineage 1
MMseqs Output (for bugs)
Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.
Full output and error attached below
tmp/10336174962539687461/taxonomy_tmp/11653652317365833767/tmp_taxonomy/6923600097584969791/taxonomy.sh: line 58: 78000 Segmentation fault (core dumped) "$MMSEQS" lca "${TARGET}" "${LCAIN}" "${RESULTS}" ${LCA_PAR}
Context
Providing context helps us come up with a solution and improve our documentation for the future.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
- Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 14.7e284
- Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): bioconda
- For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
- Server specifications (especially CPU support for AVX2/SSE and amount of system memory): 40 cpus, 700gb RAM
- Operating system and version:
- Operating System: CentOS Linux 7 (Core)
- Kernel: Linux 3.10.0-1160.62.1.el7.x86_64
job-mmseqs_easytax_050523_error.txt job-mmseqs_easytax_050523_out.txt
How large is the database you created? Would it be possible to share?
How does your tax mapping look like (UVIG_taxid_mapping_cleaned). It seems to create some very large taxid values (1446979566). Maybe I didn't correctly consider that they could be so large.
I got a similar error with itsmisterbrown that the LCA step dies due to a segmentation fault. Here is my command line. And I also attached my log and error files. out.txt err.txt
mmseqs easy-taxonomy \
test.fasta nr.smag.mmetsp.gvog.faaDB \
DB_NR.SMAG.DB_tax_result_test \
tmp \
--orf-filter 0 \
--threads 16 \
--lca-ranks superkingdom,phylum,class,order,family,genus \
--split-memory-limit 500G
Please help me to find out what wrong with my command.