NanoPlot icon indicating copy to clipboard operation
NanoPlot copied to clipboard

Mean read quality calculation

Open alvanuffelen opened this issue 1 year ago • 5 comments

The mean read quality is calculated by:

  1. Converting the mean phred quality score of each read to the base-calling error probability
  2. Calculating the mean of all the mean error probabilities from step 1
  3. Converting the total mean error probability from step 2 back to a phred quality score

In Nanomath, for step 1, the mean phred score of each read is interpreted as an int instead of a float.

Does this conversion not introduce an error in the mean read quality? If I have a read with a mean quality of 16.8, the script converts it to 16 and the probability becomes 0.02512. But according to the formula $P = 10^{-\frac{10}{Q}}$, it should be 0.02089.

Using this test file (columns are read id, length and mean quality) and without converting to int:

import pandas as pd
import numpy as np
from math import log 

df = pd.read_table("test.txt",
                   header=None,
                   names=['id', 'length', 'quality'])
                   
convert_to_probs = lambda q: 10 ** (-q/10)
vfunc = np.vectorize(convert_to_probs)
probs = vfunc(df['quality'])
-10 * log(probs.sum() / len(probs), 10)

The mean read quality 13.22437. If I convert the scores to int:

probs = vfunc(df['quality'].astype(int))
-10 * log(probs.sum() / len(probs), 10)

The mean read quality is 12.71993, the same as Nanoplot reports (see attached file). NanoStats.txt

Is there a reason to convert the mean score of each read to int before calculating the probabilities?

alvanuffelen avatar Jul 10 '24 10:07 alvanuffelen

I think you are right and this is an error! I'll do my best to fix this soon, or you are welcome to open a pull request :)

wdecoster avatar Jul 10 '24 13:07 wdecoster

I think you are right and this is an error! I'll do my best to fix this soon, or you are welcome to open a pull request :)

I notice the latest release is in Oct, 2023. Have the error been fixed now?

GZZHY79 avatar Aug 08 '24 11:08 GZZHY79

Wouter, This comment is being added here as another item to possibly consider if/when revisiting the mean qscore.

Not exactly this issue with the mean qscore, but having read your informative blog posts about them, I found an Issue on the dorado repo that might explain the qscore difference from what the basecaller reports and NanoMath/NanoPlot report.

Issue is https://github.com/nanoporetech/dorado/issues/937

It explains that, for whatever reason, the first 60 bases are skipped for the mean qual calc. Doing this does gives a qscore in agreement with basecaller values in my modest testing. This default of 60 is set in function get_mean_qscore_start_pos_by_model_name in file dorado/basecall/CRFModelConfig.cpp of the repo. Apparently models could set this as a value but usually do not, so 60 bases are typically skipped.

I use your method -- thanks! -- to reattach a mean qscore after trimming to guide user filtering of reads and will continue to take all bases into account. These differences would seem to argue more for cropping low-qual beginning bases than ignoring them in a calculation.

In any event, I thought I'd pass this along since I'd found the difference a puzzler.

best, jim

jbh-cas avatar Sep 16 '24 00:09 jbh-cas

Hi Jim,

Thanks for letting me know! Skipping some bases sounds entirely reasonable to me, dealing with issues because of adapter sequences and the start of the read. I might adopt that :-)

Best, Wouter

wdecoster avatar Sep 16 '24 12:09 wdecoster

Wouter,

Yes, adapters had occurred to me as well. Though, after trimming, ignoring those bases seems something of a fiction doesn't it (and I wonder how low qual prefixes affect OLC style algorithms).

Perhaps an option (options, I know, ugh) for nanopore or complete qscore. With the popularity of your programs it would bring visibility to this issue, which I think would be of value to the community.

best, Jim

On 09/16/2024 5:04 AM PDT Wouter De Coster @.***> wrote:

Hi Jim,

Thanks for letting me know! Skipping some bases sounds entirely reasonable to me, dealing with issues because of adapter sequences and the start of the read. I might adopt that :-)

Best, Wouter

— Reply to this email directly, view it on GitHub https://github.com/wdecoster/NanoPlot/issues/376#issuecomment-2352726662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELSO4JFD7LQVVYN4LTXPJLZW3COBAVCNFSM6AAAAABKUSPVVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJSG4ZDMNRWGI. You are receiving this because you commented.Message ID: @.***>

jbh-cas avatar Sep 16 '24 18:09 jbh-cas