custom database creation error on ctranslate step
Expected Behavior
Custom database created for dbCAN v8.
Current Behavior
Error during the cstranslate step.
Steps to Reproduce (for bugs)
# creating custom dbCAN hhsuite database
## download MSA from http://bcb.unl.edu/dbCAN2/download/ (and uncompress)
http://bcb.unl.edu/dbCAN2/download/dbCAN-fam-aln-V8.tar.gz
tar -pzxvf dbCAN-fam-aln-V8.tar.gz
## build from MSAs
cd dbCAN-fam-aln-V8
ffindex_build -s ../dbCAN-fam-aln-V8.ff{data,index} .
cd ../
## concensus
ffindex_apply dbCAN-fam-aln-V8.ffdata dbCAN-fam-aln-V8.ffindex -i dbCAN-fam-aln-V8_a3m.ffindex -d dbCAN-fam-aln-V8_a3m.ffdata -- hhconsensus -M 50 -maxres 65535 -i stdin -oa3m stdout -v 0
## hmm
ffindex_apply dbCAN-fam-aln-V8_a3m.ff{data,index} -i dbCAN-fam-aln-V8_hhm.ffindex -d dbCAN-fam-aln-V8_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
## context states
cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m -o dbCAN-fam-aln-V8_cs219
HH-suite Output (for bugs)
If using cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m -o dbCAN-fam-aln-V8_cs219:
Reading context library for pseudocounts from internal ...
Reading abstract state alphabet from internal ...
ERROR: Unable to read input file 'dbCAN-fam-aln-V8_a3m'!
If using cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m.ffdata -o dbCAN-fam-aln-V8_cs219:
Reading context library for pseudocounts from internal ...
Reading abstract state alphabet from internal ...
ERROR: Sequence 468 has 181 match columns but should have 613!
Your Environment
Ubuntu 18.04.4
# conda env
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 0_gnu conda-forge
bzip2 1.0.8 h516909a_2 conda-forge
ca-certificates 2020.6.20 hecda079_0 conda-forge
certifi 2020.6.20 py37hc8dfbb8_0 conda-forge
curl 7.69.1 h33f0ec9_0 conda-forge
fqtools 2.0 hc0aa232_5 bioconda
hhsuite 3.2.0 py37pl526h3340039_1 bioconda
htslib 1.9 h4da6232_3 bioconda
krb5 1.17.1 h2fd8d38_0 conda-forge
ld_impl_linux-64 2.34 h53a641e_5 conda-forge
libcurl 7.69.1 hf7181ac_0 conda-forge
libdeflate 1.6 h516909a_0 conda-forge
libedit 3.1.20191231 h46ee950_0 conda-forge
libffi 3.2.1 he1b5a44_1007 conda-forge
libgcc-ng 9.2.0 h24d8f2e_2 conda-forge
libgomp 9.2.0 h24d8f2e_2 conda-forge
libssh2 1.9.0 hab1572f_2 conda-forge
libstdcxx-ng 9.2.0 hdf63c60_2 conda-forge
llvm-openmp 8.0.1 hc9558a2_0 conda-forge
ncurses 6.1 hf484d3e_1002 conda-forge
openmp 8.0.1 0 conda-forge
openssl 1.1.1g h516909a_0 conda-forge
perl 5.26.2 h516909a_1006 conda-forge
pip 20.1.1 py_1 conda-forge
python 3.7.6 cpython_h8356626_6 conda-forge
python_abi 3.7 1_cp37m conda-forge
readline 8.0 hf8c457e_0 conda-forge
seqkit 0.12.1 0 bioconda
setuptools 47.3.1 py37hc8dfbb8_0 conda-forge
sqlite 3.30.1 hcee41ef_0 conda-forge
taxonkit 0.5.0 0 bioconda
tk 8.6.10 hed695b0_0 conda-forge
wheel 0.34.2 py_1 conda-forge
xz 5.2.5 h516909a_0 conda-forge
zlib 1.2.11 h516909a_1006 conda-forge
Ah I've been meaning to build a database from dbCAN since a while, thanks for the reminder.
I tried to reproduce building the database and it works correctly with the *_mpi binaries.
Something like this works for me:
DB=dbCAN-fam-V8
wget http://bcb.unl.edu/dbCAN2/download/dbCAN-fam-aln-V8.tar.gz
tar xzvf dbCAN-fam-aln-V8.tar.gz
cd dbCAN-fam-aln;
ffindex_build -s ../${DB}_msa.ff{data,index} .
cd ..
sed 's|\.aln||g' ${DB}_msa.ffindex > ${DB}_msa_renamed.ffindex
mv ${DB}_msa_renamed.ffindex ${DB}_msa.ffindex
mpirun -np 16 ffindex_apply_mpi ${DB}_msa.ffdata ${DB}_msa.ffindex -i ${DB}_a3m.ffindex -d ${DB}_a3m.ffdata -- hhconsensus -M 50 -maxres 65535 -i stdin -oa3m stdout -v 0
mpirun -np 16 ffindex_apply_mpi ${DB}_a3m.ff{data,index} -i ${DB}_hhm.ffindex -d ${DB}_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
mpirun -np 16 cstranslate_mpi -x 0.3 -c 4 -I a3m -i ${DB}_a3m -o ${DB}_cs219
# reorder according to cs219 for better access patterns
sort -k 3 -n ${DB}_cs219.ffindex | cut -f1 > ${DB}.list
for type in a3m hhm; do
ffindex_order ${DB}.list ${DB}_${type}.ffdata ${DB}_${type}.ffindex ${DB}_${type}_opt.ffdata ${DB}_${type}_opt.ffindex
mv -f ${DB}_${type}_opt.ffdata ${DB}_${type}.ffdata
mv -f ${DB}_${type}_opt.ffindex ${DB}_${type}.ffindex
done
md5deep ${DB}_{a3m,hhm,cs219}.ff{data,index} > ${DB}.md5sum
tar czvf ${DB}.tar.gz ${DB}_{a3m,hhm,cs219}.ff{data,index} ${DB}.md5sum
I took the liberty to build this database and put it on our file server: http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/dbCAN-fam-V8.tar.gz
I would recommend to search through it with HHsearch instead of HHblits though. Due to it's small size HHsearch can still easily handle it and it will be more sensitive.
Hello? I want to know how you get the *_mpi binaries? The document didn't declare the process of installing hh-suite with MPI support? Could you please tell me how to do it? Thanks! I also met the problem `Reading context library for pseudocounts from context_data.lib ... Reading abstract state alphabet from cs219.lib ...
ERROR: Sequence 1 has 764 match columns but should have 2021! `
I added a section to the wiki: https://github.com/soedinglab/hh-suite/wiki#mpi-support
I think you were missing the -f or --ffindex flag of cstranslate to switch from single file mode to database read in.
That might be what was causing the error message.
I made a new DB for V9: http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/dbCAN-fam-V9.tar.gz
The dbCAN team thankfully provided the raw alignments for the new release.