SINA icon indicating copy to clipboard operation
SINA copied to clipboard

Problems with LCA quorum

Open jgerken opened this issue 5 years ago • 7 comments

Hi Elmar,

we stumbled across a weird problem when classifying sequences with SINA. One of our users searched for a partial sequences which as classified as Archaea;Halobacterota; despite eight of the ten nearest neighbours having a classification of Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanomicrobiaceae;Methanoculleus;.

I created a minimal working example with a SILVA dataset reduced to the ten nearest neighbours. When you execute

./sina -i aligned.query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.95 --lca-quorum 0.8 --search-no-fast --lca-fields tax_slv -v -v -v
...
21:26:42 [log] lca_tax_slv: Archaea;Halobacterota;

I then searched for a very similar full-length sequence in SILVA an ran the following query:

./sina -i aligned.full-length-query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.5 --lca-quorum 0.8 --search-no-fast --lca-fields tax_slv -v -v -v
...
21:26:42 [log] lca_tax_slv: Archaea;Halobacterota;

But when I change the the --lcq-quorum to 0.79 I get the expected result:

./sina -i aligned.full-length-query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.5 --lca-quorum 0.79 --search-no-fast --lca-fields tax_slv -v -v -v
...
21:30:17 [log] lca_tax_slv: Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanomicrobiaceae;Methanoculleus;

Between the 2nd and the 3rd example, there is a floating point precision problem. I think the ratio that is compared with the value of lca_quorum needs to be rounded according to the precision of the floating point type used.

The difference between the first and the other two examples is that the first run returns one of the two sequences with a deviating classification of Archaea;Halobacterota;Methanosarcinia;Methanosarciniales;Methanosarcinaceae;Methanosarcina; as most similar neighbour whereas in the other two runs a sequence of the majority is selected. Thus, it seems that the LCA quorum always tries to use the taxonomy of the most similar neighbour instead of trying find the most common classification that fulfils the quorum.

Could you have a look?

Thanks Jan

minimal-example.tar.gz

jgerken avatar Aug 17 '20 19:08 jgerken

I'll have a look at it. Note on the side, you can write -vvv instead of -v -v -v. I assume you were using the 1.7.0?

epruesse avatar Aug 17 '20 21:08 epruesse

Thanks :) Yes, I used 1.7.0 but also tried 1.6.1 and 1.2.11 and the classification result was the same with all versions (did not check the nearest neighbors in details with the old versions, though).

jgerken avatar Aug 17 '20 21:08 jgerken

mind if I add the example to the unit tests?

epruesse avatar Aug 18 '20 23:08 epruesse

Simple enough. Classic int/float truncation issue. Should work now. I'm pushing 1.7.1, should be in bioconda within a day or so.

epruesse avatar Aug 19 '20 00:08 epruesse

mind if I add the example to the unit tests?

No, go ahead.

The submitted fix only fixes the difference between the second and third execution. The problem that the first test case

./sina -i aligned.query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.95 --lca-quorum 0.8 --search-no-fast --lca-fields tax_slv -vvv

returns Archaea;Halobacterota; instead of Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanomicrobiaceae;Methanoculleus; despite 8 of the 10 top hits having the latter classification is still present in 1.7.1.

jgerken avatar Aug 19 '20 13:08 jgerken

Darn - didn't see that part. Case of TL'DR I suppose.

epruesse avatar Aug 19 '20 16:08 epruesse

You are correct in your observation. The algorithm does an LCA between the best match and the others. Guess that was usually so close to what it's supposed to do that no-one notices. This will take me a day or two to fix. Don't have enough time in the evenings.

epruesse avatar Aug 21 '20 05:08 epruesse