SpecHLA icon indicating copy to clipboard operation
SpecHLA copied to clipboard

HLA typing quality score evaluation

Open gunyorka opened this issue 7 months ago • 3 comments

Dear SpecHLA Team,

I successfully run SpecHLA on hundreds of WXS samples. Now im struggling with the evaluation of the results. On the attached picture there is the quality scores imported for one sample from hla_detailed_results.txt. Could you please help me how to interpret the numbers since i havent found a clear guide for that. I was expecting the numbers to range from 0-100 representig accuracy, but the high numbers at the end of the data makes me question my assumptions.

I have checked manually the original .txt file, the same number occurs there: DRB1*03:01;1371.000;0.077;0.064;0.047

Image

Thank you very much for the help,

Benjamin

gunyorka avatar Jun 10 '25 05:06 gunyorka

Dear Benjamin,

Thank you for your question.

For SpecHLA results:

  • For the first few genes, the score is capped at 100 and represents sequence similarity.

  • For the latter genes (typically class II like DRB1), the score is calculated as sequence length × sequence similarity, so the values can exceed 100.

This is expected behavior and not an error. Let us know if you have further questions!

wshuai294 avatar Jun 10 '25 06:06 wshuai294

Dear Mr Wang,

Thank you for the fast response. I would like to ask what cutoffs should I use for the different genes to decide if the HLA genotyping was accurate and I can use the result for downstream analyses? I was wondering if is this the right score to use for such a decision?

At another sample I left with ambiguous results for he DP1: "DPB103:01;99.868;0.111;0.042;0.029" "DPB129:01;99.868;0.000;0.001;0.001"

How should I interpret such result when the score is very similar for two different alleles?

Thank you, Benjamin

gunyorka avatar Jun 10 '25 06:06 gunyorka

For the first question, it is hard to say what cutoff to use. Maybe some manual observasion would help. Also, a high depth (i.e. more supporting reads) will lead to a more accurate result. For the second question, if the similarity is similar, then the population-specific allele frequency would be usefull. In your case, as DPB103:01 has a high frequency than DPB129:01, it is better to select DPB1*03:01.

wshuai294 avatar Jun 10 '25 06:06 wshuai294