MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Missing genes in output after adding alignments to clustering.

Open rotheconrad opened this issue 3 years ago • 2 comments

Expected Behavior

All genes from the input file should be in the output.

Current Behavior

Without adding the alignment the behavior is as expected. But when alignment is included some genes go missing from the output file.

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

mmseqs createdb $infile $DB/$n --dbtype 2 mmseqs cluster $DB/$n $C90/${n}_C90_cluster ${n}_tmp --min-seq-id 0.90 --cov-mode 1 -c 0.5 --cluster-mode 2 --cluster-reassign --threads 5 mmseqs align $DB/$n $DB/$n $C90/${n}_C90_cluster $C90/${n}_C90_align --alignment-mode 2 --threads 5 mmseqs createtsv $DB/$n $DB/$n $C90/${n}_C90_cluster ${odir}/${n}_C90_cluster.tsv --threads 5 mmseqs createtsv $DB/$n $DB/$n $C90/${n}_C90_align ${odir}/${n}_C90_cluster_align.tsv --threads 5

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

wc -l 09d_mmseqs_cluster_tsv/* 344349 09d_mmseqs_cluster_tsv/Vibrio_cholerae_C90_cluster_align.tsv 358964 09d_mmseqs_cluster_tsv/Vibrio_cholerae_C90_cluster.tsv

Context

Providing context helps us come up with a solution and improve our documentation for the future. I'm trying to use MMSeqs2 instead of CD-HIT but I need the alignment of each gene in a cluster to the representative gene.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters):
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.):
  • For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
  • Operating system and version:

rotheconrad avatar Aug 26 '22 21:08 rotheconrad

The alignments are probably falling above the default alignment E-value threshold. Just give a large value to the -e parameter and it should not reject any alignments anymore.

milot-mirdita avatar Aug 27 '22 06:08 milot-mirdita

Ok. That totally works. Thank you!!

I had to give it a super large value. Should I think about adjusting some clustering thresholds? Does this mean some of the sequences in that end up in various clusters aren't a great fit?

rotheconrad avatar Aug 29 '22 20:08 rotheconrad