gffcompare icon indicating copy to clipboard operation
gffcompare copied to clipboard

Stringent comparison of CDS using --strict-match

Open etvedte opened this issue 1 year ago • 0 comments

Greetings,

I am interested in computing accuracy metrics for a query GFF against a reference. The reference/query files have both CDS and exon features. I want to perform accuracy calculations using strict terminal boundaries, operating on CDS specifically.

I did some testing and made the following observations:

  1. Exon features are prioritized for accuracy metrics, but CDS can still be used. That is, removing exon rows changes the accuracy values when calculated from CDS+exon, but removing CDS rows does not.
  2. The -e parameter reads "max. distance (range) allowed from free ends of terminal exons of reference transcripts." Setting -e 0 in CDS file only changes exon-level accuracy metrics. Transcript/locus level are unchanged. Sensitivity/Precision unsurprisingly dips slightly with -e 0
  3. In the documentation under transcript description, but not in the parameter list: "Using --strict-match option can make the accuracy estimation at this level much more stringent by only allowing a limited variation of the outer coordinates of the terminal exons (by at most 100 bases by default, but this value can be changed with the -e option)." When I set --strict-match -e 0, the exon/intron level remains the same relative to -e 0, but intron-chain/transcript/locus level all decrease.

Given the observations above I think --strict-match -e 0 is the correct way to stringently compare CDS. Do you agree, or maybe have a different suggestion? The parameter --strict-match isn't clearly described in the documentation. By "only allowing a limited variation of the outer coordinates of the terminal exons" , does this mean when running default gffcompare (--strict-match is not specified) then terminal exon boundaries can be extremely different so long as they have matching intron chains?

As an aside, I am not sure why one would want to calculate accuracy using -e 0 alone, which allows some fuzziness in the parent-level features but is strict at exon/intron level. Have you observed any specific use cases for this?

Thanks,

Eric

etvedte avatar Jul 29 '24 12:07 etvedte