Missing genes in output after adding alignments to clustering.
Expected Behavior
All genes from the input file should be in the output.
Current Behavior
Without adding the alignment the behavior is as expected. But when alignment is included some genes go missing from the output file.
Steps to Reproduce (for bugs)
Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.
mmseqs createdb $infile $DB/$n --dbtype 2 mmseqs cluster $DB/$n $C90/${n}_C90_cluster ${n}_tmp --min-seq-id 0.90 --cov-mode 1 -c 0.5 --cluster-mode 2 --cluster-reassign --threads 5 mmseqs align $DB/$n $DB/$n $C90/${n}_C90_cluster $C90/${n}_C90_align --alignment-mode 2 --threads 5 mmseqs createtsv $DB/$n $DB/$n $C90/${n}_C90_cluster ${odir}/${n}_C90_cluster.tsv --threads 5 mmseqs createtsv $DB/$n $DB/$n $C90/${n}_C90_align ${odir}/${n}_C90_cluster_align.tsv --threads 5
MMseqs Output (for bugs)
Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.
wc -l 09d_mmseqs_cluster_tsv/* 344349 09d_mmseqs_cluster_tsv/Vibrio_cholerae_C90_cluster_align.tsv 358964 09d_mmseqs_cluster_tsv/Vibrio_cholerae_C90_cluster.tsv
Context
Providing context helps us come up with a solution and improve our documentation for the future. I'm trying to use MMSeqs2 instead of CD-HIT but I need the alignment of each gene in a cluster to the representative gene.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
- Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters):
- Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.):
- For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
- Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
- Operating system and version:
The alignments are probably falling above the default alignment E-value threshold. Just give a large value to the -e parameter and it should not reject any alignments anymore.
Ok. That totally works. Thank you!!
I had to give it a super large value. Should I think about adjusting some clustering thresholds? Does this mean some of the sequences in that end up in various clusters aren't a great fit?