vclust icon indicating copy to clipboard operation
vclust copied to clipboard

Allow bidirectional coverage filtering

Open apcamargo opened this issue 1 year ago • 1 comments

When dereplicating sequences, it is crucial to cluster genomes that significantly overlap each other (i.e., high bidirectional coverage) to avoid clustering sequences where one is entirely contained within the other. For example:

idx1   idx2   id1     id2     tani       gani       ani        cov        num_alns   len_ratio
----   ----   -----   -----   --------   --------   --------   --------   --------   ---------
0      1      seq_1   seq_2   0.581627   0.412891   0.991158   0.416575   3          2.479452 
1      0      seq_2   seq_1   0.581627   1.000000   1.000000   1.000000   2          0.403315 

In the example above, it is there's no combination of --cov and --ani that prevent these two sequences from being clustered together. Although setting a high --tani cutoff could resolve this specific case, it would not be effective in more complex scenarios. For instance, if the goal is to connect pairs of genomes with a minimum bidirectional coverage of 75% and no minimum ANI, using --tani wouldn't work.

A related question: how does Clusty handle multiple edges between node pairs? For instance, with --metric ani --ani 0.1, both edges in the example would pass the filters. In such cases, how does Clusty determine the clustering weight? Does it use the maximum ANI value, the mean, the first/last value in the file, something else?

apcamargo avatar Aug 25 '24 06:08 apcamargo

Thank you for the great suggestion! In Vclust v1.1.1, we've introduced query and reference coverage (qcov and rcov) in the output, allowing for bidirectional coverage filtering to handle cases like the one you described.

When it comes to clustering with multiple edges between node pairs, Clusty selects the maximum value for clustering (e.g., ANI if --metric ani). All clustering algorithms in Clusty, except for the Leiden algorithm, are threshold-based and do not rely on weights (i.e., a genome is clustered if it meets the similarity threshold with a centroid, closest, or furthest member).

For the Leiden algorithm, if you want to include coverage information in the edge weight, you can use the gani metric. This is calculated as the number of identical nucleotides divided by the length of the query sequence (i.e., ANI multiplied by query coverage), which aligns with the approach used by IMG/PR.

aziele avatar Sep 23 '24 13:09 aziele

Hi @aziele, Going back to this issue, would it be possible to set a ani cutoff while using gani as edge weights? For instance:

vclust cluster -i vclust/ani.tsv -o vclust/clusters.tsv --ids vclust/ani.ids.tsv \
  --algorithm leiden --leiden-resolution 1.0 --qcov 0.95 --rcov 0.95 --ani 0.95 --metric gani

I expected that this command would first filter the edges based on --qcov, --rcov, and --ani and then cluster the graph using gani as edge weights. However, I get the following error when running it:

vclust: error: gani threshold must be above 0. Specify the option: --gani

Which implies that --gani would be used for edge filtering (not sure if --ani would be ignored).

apcamargo avatar Nov 21 '25 11:11 apcamargo

Hi @apcamargo,

Yes, you can do this. The only thing you need to specify in your command is the --gani threshold. For example, if you want to keep all edges that passed the ani & qcov & rcov filters and only use gani as the edge weight, set --gani to a very small value greater than 0 (for example --gani 0.0001) and keep it as a metric (--metric gani). In that setup, --ani, --qcov, and --rcov will perform the filtering, and the graph will then be built using gani as the metric.

vclust cluster -i vclust/ani.tsv -o vclust/clusters.tsv --ids vclust/ani.ids.tsv \
  --algorithm leiden --leiden-resolution 1.0 --qcov 0.95 --rcov 0.95 --ani 0.95 --metric gani --gani 0.0001

aziele avatar Nov 21 '25 11:11 aziele

Thanks, @aziele!

Is there a specific reason to require --gani in this case? If the parameter is not actually needed, the current error message may be confusing for users.

apcamargo avatar Nov 21 '25 11:11 apcamargo

There's no real reason for this - it's just how the current argument parser is set up. Right now it always expects a threshold for the metric used in clustering, even when it's not actually needed (like in your case). We'll fix this in version 2: the interface will be simpler, and we'll finally add a query-vs-database search mode (similar to BLAST).

Thanks for catching this, @apcamargo!

aziele avatar Nov 21 '25 13:11 aziele

Thanks, @aziele!

apcamargo avatar Nov 22 '25 02:11 apcamargo