foldseek icon indicating copy to clipboard operation
foldseek copied to clipboard

Cluster based on e-value/tmscore ?

Open Wangchentong opened this issue 1 year ago • 3 comments

Expected Behavior

When i run the easy-cluster wih a set of rfdiffusion generated structures, i obeserve that with foldseek cluster program which based e-value will give the ooposite trend, compared to cluster based on tm-score threshold(tmscore cutoff 0.6)

Current Behavior

image The light blue is the total count of scaffold of each length, the drak blue is the count of clusters, why when use e-values there will be less cluster when length increase while use tm-score the trend is opposite? what;s your recommondation cluster creterion when calculates the structure diversity of structure generation model?

Wangchentong avatar Oct 11 '24 04:10 Wangchentong

cluster by e-value : foldseek easy-cluster pdb/ merge tmp/ -c 0.8 cluster by tmscore: foldseek easy-cluster pdb/ merge tmp/ -c 0.8 --tmscore-threshold 0.6

Wangchentong avatar Oct 11 '24 04:10 Wangchentong

I think It might be based on tmscore, becasue my clustered groups by default settings foldseek easy-cluster pdb/ result tmp are same as the clustered groups by setting the --alignment-type 1

 --alignment-type INT             How to compute the alignment:
                                  0: 3di alignment
                                  1: TM alignment
                                  2: 3Di+AA [2]

However, we can see the default setting of alignment type is 2: 3Di+AA. I'm not sure whether it also calculates tmscore during this alignment type.

Huilin-Li avatar Oct 14 '24 14:10 Huilin-Li

The threshold you set is crucial. Here are some parameters to consider:

  • -c controls the alignment coverage (default = 0.8); I recommend increasing it to 0.9.
  • --cluster-reassign 1 addresses issues caused by transitive clustering, where coverage violations can occur.
  • -e adjusts the e-value to a more stringent level, this should improve accuracy.
  • --tmscore-threshold sets the TM-score threshold for alignment (compatible with all alignment types). However, this only works well for super-posable structures, which many multi-domain proteins are not.
  • --lddt-threshold sets the alignment LDDT score threshold for the alignment (compatible with all alignment types). Works also for multi-domain proteins

martin-steinegger avatar Oct 14 '24 15:10 martin-steinegger