Clustering using a batch system

Open boratyng opened this issue 2 years ago • 0 comments

Hi,

I am trying to use MMseqs2 to cluster a large protein database, splitting the work into batch jobs. I followed the search example from https://github.com/soedinglab/mmseqs2/wiki#how-to-run-mmseqs2-on-multiple-servers-using-batch-systems, searching batches of the database against the whole database. Then I am trying to use search results to compute clusters with the clust subcommand. Here is my script:

$MMSEQS createdb $INFASTA $DB
$MMSEQS splitdb $DB ${DB}_split --split $NUM_SPLITS

for i in $(ls ${DB}_split_*_$NUM_SPLITS) ; do
      $MMSEQS search $i $DB ${i}_search tmp
done

$MMSEQS mergedbs ${DB}_split_0_${NUM_SPLITS}_search ${DB}_search $(awk 'BEGIN {for (i=1;i < '$NUM_SPLITS';i++) printf("'$DB'_split_%d_'$NUM_SPLITS'_search ", i);}')

$MMSEQS clust ${DB} ${DB}_search ${DB}_clust

mmseqs clust gives Sequence db size != result db size error.

Is there a way to combine the search results into one results database or compute clusters for each of my database batch and merge them, or any other way do clustering on a batch system (without MPI)?

Your Environment

Linux CentOs. MMseqs2 Release 14-7e284: https://github.com/soedinglab/MMseqs2/releases/download/14-7e284/mmseqs-linux-avx2.tar.gz

Oct 12 '23 14:10 boratyng