foldseek icon indicating copy to clipboard operation
foldseek copied to clipboard

A observed discrepency between alignment-type 3Di+AA / 3Di

Open Wangchentong opened this issue 1 year ago • 2 comments

Expected Behavior

Thanks for your amazing tool! I am clustering a bunch of afdb subset which has high confidence with two alignment-type 3Di+AA / 3Di. In my intuition, 3Di should give more non-singleton cluster compared to 3Di+AA, because the very diverse sequence which hold same structure will be assined to same cluster in 3Di mode, and assigned to different clusters in 3Di+AA mode.

Current Behavior

I test the cluster command of two aliment types on the same database(a subset contains 4 million afdb structure), --alignment-type 0(3Di) gives me 470715 singleton --alignment-type 1(3Di) gives me 759500 singleton

this is the cluster command i use: foldseek cluster afdb50_new afdb50_new_clust_v2 tmp --remove-tmp-files --alignment-type 0/--alignment-type 2

Is this the epxpected result? It looks the cluster program based on solely 3Di token work worse than 3Di+AA, what;s your suggestion if i want to cluster on structure without AA token?

Any help will be gratitude!

Wangchentong avatar Jun 13 '24 13:06 Wangchentong

Not using the amino-acid information will likely result in a less biologically meaningful result. Foldseek was optimized towards remote homology detection and for this I would recommend to stick to 3Di+AA. We were thinking of dropping the 3Di-only mode completely as, as we don't think that there are many applications where it's really meaningful.

I don't know towards what end you are clustering, I would recommend to focus more on your clustering criteria, like at what coverage, sequence-identity and E-value you still accept cluster members.

milot-mirdita avatar Jun 14 '24 09:06 milot-mirdita

Hi Milot @milot-mirdita , thanks for your quick response.

My purpose of clustering is to collect a dataset of highly diverse structures to train deep learning models. So i hope only to consider the structure's similarity rather than sequence identity.

So I will follow your advice to use the 3Di+AA alignment type, but I also want to know is there any other option that can strengthen my requirement?

Wangchentong avatar Jun 14 '24 10:06 Wangchentong

The default clustering will yield significantly divergent clusters. However, it is essential to also check the local similarity across structures to avoid homology leakage by comparing the representatives against each other using foldseek search. Doing both should be a meaningful split.

martin-steinegger avatar Jan 19 '25 11:01 martin-steinegger