`mmseqs expandaln` error: Invalid database read | getData: local id >= db size
Expected Behavior
mmseqs expandaln to complete successfully.
Current Behavior
mmseqs expandaln throws this error:
Invalid database read for database data file=/home/user/project/target_DB/target_DB.idx, database index=/home/user/project/target_DB/target_DB.idx.index
getData: local id (4294967295) >= db size (22)
Steps to Reproduce (for bugs)
All these commands are executed when i run colabfold_search and fails on expandaln.
createdb result_20230419_115721/query.fas result_20230419_115721/qdb --shuffle 0
search result_20230419_115721/qdb /home/user/project/target_DB/target_DB result_20230419_115721/res result_20230419_115721/tmp --threads 96 --num-iterations 3 --db-load-mode 2 -a -s 8 -e 0.1 --max-seqs 10000
prefilter result_20230419_115721/qdb /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3
align result_20230419_115721/qdb /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_0 result_20230419_115721/tmp/16464230693756166324/aln_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 1 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3
result2profile result_20230419_115721/qdb /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/aln_0 result_20230419_115721/tmp/16464230693756166324/profile_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -e 0.1 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --comp-bias-corr-scale 1 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --db-load-mode 2 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --gap-pc 10 --threads 96 --compressed 0 -v 3
prefilter result_20230419_115721/tmp/16464230693756166324/profile_0 /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_tmp_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3
subtractdbs result_20230419_115721/tmp/16464230693756166324/pref_tmp_1 result_20230419_115721/tmp/16464230693756166324/aln_0 result_20230419_115721/tmp/16464230693756166324/pref_1 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3
rmdb result_20230419_115721/tmp/16464230693756166324/pref_tmp_1
align result_20230419_115721/tmp/16464230693756166324/profile_0 /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_1 result_20230419_115721/tmp/16464230693756166324/aln_tmp_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3
mergedbs result_20230419_115721/tmp/16464230693756166324/profile_0 result_20230419_115721/tmp/16464230693756166324/aln_1 result_20230419_115721/tmp/16464230693756166324/aln_0 result_20230419_115721/tmp/16464230693756166324/aln_tmp_1
rmdb result_20230419_115721/tmp/16464230693756166324/aln_0
rmdb result_20230419_115721/tmp/16464230693756166324/aln_tmp_1
result2profile result_20230419_115721/tmp/16464230693756166324/profile_0 /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/aln_1 result_20230419_115721/tmp/16464230693756166324/profile_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -e 0.1 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --comp-bias-corr-scale 1 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --db-load-mode 2 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --gap-pc 10 --threads 96 --compressed 0 -v 3
prefilter result_20230419_115721/tmp/16464230693756166324/profile_1 /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_tmp_2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3
subtractdbs result_20230419_115721/tmp/16464230693756166324/pref_tmp_2 result_20230419_115721/tmp/16464230693756166324/aln_1 result_20230419_115721/tmp/16464230693756166324/pref_2 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3
rmdb result_20230419_115721/tmp/16464230693756166324/pref_tmp_2
align result_20230419_115721/tmp/16464230693756166324/profile_1 /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_2 result_20230419_115721/tmp/16464230693756166324/aln_tmp_2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3
mergedbs result_20230419_115721/tmp/16464230693756166324/profile_1 result_20230419_115721/res result_20230419_115721/tmp/16464230693756166324/aln_1 result_20230419_115721/tmp/16464230693756166324/aln_tmp_2
rmdb result_20230419_115721/tmp/16464230693756166324/aln_tmp_2
expandaln result_20230419_115721/qdb /home/user/project/target_DB/target_DB.idx result_20230419_115721/res /home/user/project/target_DB/target_DB.idx result_20230419_115721/res_exp --db-load-mode 2 --threads 96 --expansion-mode 0 -e 1.7976931348623157e+308 --expand-filter-clusters 1 --max-seq-id 0.95
Invalid database read for database data file=/home/user/project/target_DB/target_DB.idx, database index=/home/user/project/target_DB/target_DB.idx.index
getData: local id (4294967295) >= db size (22)
MMseqs Output (for bugs)
Context
I wish to run colabfold_search on my own database via --db1 'target_DB'. colabfold_search works fine with --db1 'uniref30_2103_db'.
Number of sequences in query.fasta: 1
egrep -c '^>' query.fasta
1
wc -l result_20230419_115721/qdb
1 result_20230419_115721/qdb
Number of sequences in target_DB.fasta: 104664
egrep -c '^>' target_DB.fasta
104664
wc -l target_DB
104664 target_DB
Number of sequences in resulting database res: 1011
wc -l result_20230419_115721/res
1011 result_20230419_115721/res
Number of sequences in intermediate databases:
wc -l result_20230419_115721/tmp/latest/pref_0
2455 result_20230419_115721/tmp/latest/pref_0
wc -l result_20230419_115721/tmp/latest/profile_0
28 result_20230419_115721/tmp/latest/profile_0
wc -l result_20230419_115721/tmp/latest/profile_1
34 result_20230419_115721/tmp/latest/profile_1
I saw in another Issue asking to see what these awk commands returned when looking at databases:
awk 'BEGIN { min = 2^32; } $3 < min { min = $3 }; $3 > max { max = $3 } { sum = sum + $3; n = n + 1; } END { print sum/n,min,max; }' $out_DB
awk 'BEGIN { min = 2^32; } $2 < min { min = $2 }; $2 > max { max = $2 } { sum = sum + $2; n = n + 1; } END { print sum/n,min,max; }' $out_DB
out_DB | col $3 | col $2
-----------------------------------+--------------------------+-----------------------
target_DB/target_DB.index | 412.665 2 8110 | 2.15005e+07 0 43190597
target_DB/target_DB.idx.index | 6.04213e+07 1 512000009 | 5.54188e+08 0 1261572096
result_20230419_115721/qdb.index | 114 114 114 | 0 0
result_20230419_115721/qdb_h.index | 190 190 190 | 0 0
result_20230419_115721/res.index | 58682 58682 58682 | 0 0
I can run these sequences via mmseqs easy-search (which does not call expandaln):
easy-search query.fasta /home/user/project/target_DB/target_DB result_DB tmp_easy_search --db-output 1 --max-seqs 10000
wc -l result_DB
606 result_DB
# awk sum/n,min,max
out_DB | col $3 | col $2
-----------------------------------+--------------------------+-----------------------
result_DB.index | 104112 104112 104112 | 0 0
Your Environment
- MMseqs2 Version: 67949d702dbfc6e5d54fdd0f14a9ab6740f11c32
- self-compiled
-
cmake version 3.16.3 - The CXX compiler identification is
GNU 9.4.0 - The C compiler identification is
GNU 9.4.0 - cmake.000.log created Sep 1 2022
- make.000.log created Sep 1 2022
- make_install.000.log created Sep 1 2022
-
-
uname -a-
Linux lambda-name 5.4.0-144-generic #161-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
-
Please let me know if there is any other information I can share to help debug this.
Kind regards.
same problem.