Clustering using a batch system
Hi,
I am trying to use MMseqs2 to cluster a large protein database, splitting the work into batch jobs. I followed the search example from https://github.com/soedinglab/mmseqs2/wiki#how-to-run-mmseqs2-on-multiple-servers-using-batch-systems, searching batches of the database against the whole database. Then I am trying to use search results to compute clusters with the clust subcommand. Here is my script:
$MMSEQS createdb $INFASTA $DB
$MMSEQS splitdb $DB ${DB}_split --split $NUM_SPLITS
for i in $(ls ${DB}_split_*_$NUM_SPLITS) ; do
$MMSEQS search $i $DB ${i}_search tmp
done
$MMSEQS mergedbs ${DB}_split_0_${NUM_SPLITS}_search ${DB}_search $(awk 'BEGIN {for (i=1;i < '$NUM_SPLITS';i++) printf("'$DB'_split_%d_'$NUM_SPLITS'_search ", i);}')
$MMSEQS clust ${DB} ${DB}_search ${DB}_clust
mmseqs clust gives Sequence db size != result db size error.
Is there a way to combine the search results into one results database or compute clusters for each of my database batch and merge them, or any other way do clustering on a batch system (without MPI)?
Your Environment
Linux CentOs. MMseqs2 Release 14-7e284: https://github.com/soedinglab/MMseqs2/releases/download/14-7e284/mmseqs-linux-avx2.tar.gz