Cluster based on e-value/tmscore ?
Expected Behavior
When i run the easy-cluster wih a set of rfdiffusion generated structures, i obeserve that with foldseek cluster program which based e-value will give the ooposite trend, compared to cluster based on tm-score threshold(tmscore cutoff 0.6)
Current Behavior
The light blue is the total count of scaffold of each length, the drak blue is the count of clusters, why when use e-values there will be less cluster when length increase while use tm-score the trend is opposite?
what;s your recommondation cluster creterion when calculates the structure diversity of structure generation model?
cluster by e-value : foldseek easy-cluster pdb/ merge tmp/ -c 0.8 cluster by tmscore: foldseek easy-cluster pdb/ merge tmp/ -c 0.8 --tmscore-threshold 0.6
I think It might be based on tmscore, becasue my clustered groups by default settings foldseek easy-cluster pdb/ result tmp are same as the clustered groups by setting the --alignment-type 1
--alignment-type INT How to compute the alignment:
0: 3di alignment
1: TM alignment
2: 3Di+AA [2]
However, we can see the default setting of alignment type is 2: 3Di+AA. I'm not sure whether it also calculates tmscore during this alignment type.
The threshold you set is crucial. Here are some parameters to consider:
-
-ccontrols the alignment coverage (default = 0.8); I recommend increasing it to 0.9. -
--cluster-reassign 1addresses issues caused by transitive clustering, where coverage violations can occur. -
-eadjusts the e-value to a more stringent level, this should improve accuracy. -
--tmscore-thresholdsets the TM-score threshold for alignment (compatible with all alignment types). However, this only works well for super-posable structures, which many multi-domain proteins are not. -
--lddt-thresholdsets the alignment LDDT score threshold for the alignment (compatible with all alignment types). Works also for multi-domain proteins