Simple peptide sequence comparison
I am trying to use your tool to compare the short peptide sequences of 25 peptides with each other. The results.m8.gz file should have 625 pairs of peptides outputted however it only has 30 pairs. This is either a default setting that I have yet to uncover or a bug. Here is my code:
#!/bin/bash #SBATCH --job-name=mmseq_mini #SBATCH --partition=devel #SBATCH --nodes=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=32G #SBATCH --time=1:00:00 #SBATCH --output=mmseqs_allvsall_%j.out #SBATCH --error=mmseqs_allvsall_%j.err
-------------------------------------------------------------------
Parameters that adjust sensitivity. These are 'maxed out' for maximum
sensitivity (and maximum output file size). Tweaking can reduce the outputfile
size if needed:
MIN_SEQ_ID (minimum sequence identity) is between 0 and 100;
SENSITIVITY is between 0 and 10;
MAX_SEQS (max # sequences to match per input sequence) Between 1 and the number of input sequences
MIN_SEQ_ID=0.0 MAX_SEQS=3000000 SENSITIVITY=10
-------------------------------------------------------------------
---- Load environment ----
module purge module load MMseqs2
---- Define paths ----
WORKROOT=$SLURM_SUBMIT_DIR INPUT_TABLE="$WORKROOT/file_v1mini.txt" TMPDIR_LOCAL="/tmp/$USER/mmseqs_${SLURM_JOB_ID}" OUTDIR="$WORKROOT/mmseqs_allvsall_out"
mkdir -p "$OUTDIR" mkdir -p "$TMPDIR_LOCAL"
---- Step 1. Convert input table (i.e., text metadata file containing SeqID and peptide sequence) to FASTA (unique headers) ----
PEPTIDE_FASTA="$TMPDIR_LOCAL/peptides_unique.fasta" awk -F'\t' 'NR>1 && $10!="" {print ">"$1"__"NR"\n"$10}' "$INPUT_TABLE" > "$PEPTIDE_FASTA"
---- Step 2. Create MMseqs2 database ----
PEPTIDE_DB="$TMPDIR_LOCAL/peptidesDB" mmseqs createdb "$PEPTIDE_FASTA" "$PEPTIDE_DB"
---- Step 3. Prepare header file for .m8 output ----
HEADER_FILE="$OUTDIR/out.header" echo -e "query\ttarget\tpident\talnlen\tmismatch\tgapopen\tqstart\tqend\ttstart\ttend\tevalue\tbitscore" > "$HEADER_FILE"
---- Step 4. Run all-vs-all search ----
On HPC 'mmseqs search' command needs to be wrapped by a call to srun to avoid a parallization failure (MPI)
RESULT_DB="$TMPDIR_LOCAL/resultDB"
srun mmseqs search "$PEPTIDE_DB" "$PEPTIDE_DB" "$RESULT_DB" "$TMPDIR_LOCAL"
--threads $SLURM_CPUS_PER_TASK
--min-seq-id $MIN_SEQ_ID
--max-seqs $MAX_SEQS
-s $SENSITIVITY
---- Step 5. Convert to BLAST-like tabular .m8 (compressed) ----
OUT_M8="$OUTDIR/out.m8.gz"
TMP_M8="$TMPDIR_LOCAL/out.tmp.m8"
mmseqs convertalis "$PEPTIDE_DB" "$PEPTIDE_DB" "$RESULT_DB" "$TMP_M8"
--format-output "query,target,pident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits"
gzip -c "$TMP_M8" > "$OUT_M8"
rm -f "$TMP_M8"
---- Step 6. Filter self-hits and short/low-identity alignments ----
FILTERED_M8="$OUTDIR/out.filtered.m8.gz" zcat "$OUT_M8" | awk -F'\t' '$1 != $2 && $3 >= 30 && $4 >= 6' | gzip > "$FILTERED_M8"
---- Step 7. Optional: global alignment on filtered hits only ----
GLOBAL_RESULT_DB="$TMPDIR_LOCAL/resultDB_global" GLOBAL_M8="$OUTDIR/out.filtered_global.m8.gz"
Extract filtered IDs
zcat "$FILTERED_M8" | cut -f1,2 | tr '\t' '\n' | sort -u > "$TMPDIR_LOCAL/filtered_ids.txt"
---- Step 8. Cleanup ----
rm -rf "$TMPDIR_LOCAL"
echo "[$(date)] All-vs-all MMseqs2 workflow completed successfully."
Any thoughts or suggestions would be greatly appreciated.
You can try the parameters we proposed in SpacePHARER (https://academic.oup.com/bioinformatics/article/37/19/3364/6207963):
--seed-sub-mat VTML40.out --gap-open 16 --gap-extend 2 --spaced-kmer-pattern 10111011 -k 6
Or switch to a gapless alignment --prefilter-mode 1 (or --gpu 1 with MMseqs2-GPU).
@milot-mirdita thank you for the suggestion. It seems that the issue is irrespective of alignment. Regardless of whether or not the alignment is correct, the result.m8 file that does not contain any filtering should contain the correct number of pairs that it attempted to align if I am understanding how the MMseqs2 tool works. If there is no alignment/sequence similarity than the pident = 0 and the pair comparison should be still part of the output document.
No, MMseqs2 does not exhaustively align everything. Hits that are rejected by the filtering steps are not reported later.
@milot-mirdita Thank you. As I mentioned there is purposefully no filtering for the result.m8 file (i.e. out.m8 file in my code) because I want to see all of the pair comparisons before they are passed to the filtering.result.m8 file (i.e. out.filtered.m8 file in my code). So, because I am not adding any filtering for the result.m8 file, I am trying to find out if there is some level of default filtering that going on under the "hood" that I am not aware of that is causing the behavior that I am experiencing. That is, 25 peptide sequences are input which should yield 625 peptide pairs but it only produces 30 peptide pairs in the out.m8 file. Otherwise, the behavior suggests that there may be a bug.