Help to troubleshoot a clustering-profile-search workflow

Open sivico26 opened this issue 1 year ago • 1 comments

Hello there,

Thanks for developing and maintaining mmseqs2. I have a slightly specific use case and I think I might be overcomplicating it, so I would like to ask for your advice on how to proceed.

For context, I am starting from a file with my target probes.fasta, which has N genes each with different versions of the gene (let's say an average of M versions). So the total number of sequences there is NxM.

Now, separately I have a set of samples and after assembling them I want to filter the assembly contigs according to the match to the genes. The current simple approach is to do a mmseqs search with probes.fasta and the assembly but I feel I can do better than that.

My idea is to cluster the genes and make profiles out of them and use those for profiles to do a more sensitive search. Now, the proper way to achieve this is what I am unsure about.

Firstly I split probes.fasta per gene, so I now have a split_probes folder where I have N fasta files. The reason to do this is that I do not want to risk the clustering to mix genes. So to start clustering:

mkdir -p seqDBs clustersDB
parallel mkdir -p tmp/tmp_{/.} ::: split_probes/*.fasta
parallel mmseqs createdb {} seqDBs/{/.} --dbtype 1 ::: split_probes/*.fasta
parallel mmseqs cluster seqDBs/{/.} clustersDB/{/.} tmp/tmp_{/.} --min-seq-id 0.8 --threads 1 ::: split_probes/*.fasta

Now this is where it starts to get confusing because for searching I want to have a single DB with all the profiles. Should I merge the clustering results and then make a profile? or should I make N profiles and then merge them? In the end, I did the latter:

mkdir -p repsDB profilesDB
parallel mmseqs createsubdb clustersDB/{/.} seqDBs/{/.} repsDB/{/.} ::: split_probes/*.fasta
parallel mmseqs createsubdb clustersDB/{/.} seqDBs/{/.}_h repsDB/{/.}_h ::: split_probes/*.fasta
parallel mmseqs result2profile repsDB/{/.} seqDBs/{/.} clustersDB/{/.} profilesDB/{/.} --threads 1 --mask-profile 0 ::: split_probes/*.fasta

A small parenthesis, I found the CLI of mmseqs mergedbs quite odd. Why is the output the second positional argument? Why not the first or the last? It would help to play more nicely with wildcard use. Anyway, that's why these

parallel echo profilesDB/{/.} ::: split_probes/*.fasta > profsDBs
mmseqs mergedbs $(head -n1 profsDBs) merged_profsDB $(tail -n +2 profsDBs)
parallel echo profilesDB/{/.}_h ::: split_probes/*.fasta > profsDBs_h
mmseqs mergedbs $(head -n1 profsDBs_h) merged_profsDB $(tail -n +2 profsDBs_h)

Finally, I got a single DB with what I thought should be a all the profiles. So I proceeded to do the mmseq search something like this:

mmseqs search ../cleaning_test/assembled/filtering/dbs/samples/Andryala_integrifolia_ERR7618428 merged_profsDB andryala_results tmp_andryala --threads 4 -s 7.5 -e 1.00E-6 --min-length 15 --remove-tmp-files -a
mmseqs convertalis ../cleaning_test/assembled/filtering/dbs/samples/Andryala_integrifolia_ERR7618428 merged_profsDB andryala_results my_prof_andryala_table.tsv --format-mode 4 --format-output "query,evalue,qstart,qend,qlen,tstart,tend,tlen,theader,gapopen,nident,mismatch" --threads 4

However, the table result I got seems malformed: my_prof_andryala_table.tsv.txt

To give you an idea of what I expect here is what the current approach produces: Andryala_integrifolia_ERR7618428.tsv.txt

This is it, I would really appreciate your advice to troubleshoot this. Thank you in advance

Jan 18 '25 13:01 sivico26

@martin-steinegger, do you know about this application of MMseqs?

Feb 18 '25 09:02 sivico26