MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Extracting alignment position information

Open OllieGrint opened this issue 5 years ago • 6 comments

First time using this package. I have run an alignment with linclust and am wondering if there is a way of extracting the positional information of the clusters and the distance between them, with the goal of using this information to plot the clusters visually.

OllieGrint avatar Nov 27 '20 13:11 OllieGrint

After running linclust you can run the align module on the alignment result. E.g.:

mmseqs createdb input.fasta input
mmseqs linclust input clu tmp
mmseqs align input input clu aln -a
mmseqs convertalis input input aln result.m8

The result.m8 will contain all pairwise alignments with the representatives to the members with the positional information.

This will not work as easily for nucleotide input though. What input are you using?

milot-mirdita avatar Dec 01 '20 10:12 milot-mirdita

I'm using protein inputs. Is there any way of extracting the identity matrix used by the program to assign sequences to clusters? Ideally looking to be able to analyze the distance within and between clusters.

OllieGrint avatar Dec 01 '20 16:12 OllieGrint

The procedure I describe gives you basically an adjacency list in the result.m8. You can build a matrix from that.

milot-mirdita avatar Dec 01 '20 16:12 milot-mirdita

Hi, I'm also trying to visualize clustering result by network graph. As I ran the indicated codes, I could get 'result.m8ls' file and it has got 10 different values in it. I'm wondering which one of the values I should use, when I build an adjacency matrix for the network graph. Could you give me some suggestion?

Thanks

mshrngci118 avatar Apr 04 '22 09:04 mshrngci118

@mshrngci118 this really depends on your use case. You could use the sequence identity or score to define the strength of the connection between two proteins. The output fields are query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits. Score = bits, fident= sequence identity.

martin-steinegger avatar Apr 04 '22 09:04 martin-steinegger

@martin-steinegger Thanks! I couldn't find out the header info from the manual, so it is very helpful. I'd like to reflect structural differences especially in their motifs, on visualization. It seems like 'fident' or '-log(evalue)' will be indicators for such criteria.

mshrngci118 avatar Apr 05 '22 09:04 mshrngci118