deepvariant hap.py benchmarking

Hello,

Sorry for writing again about some questions in the analysis, I don't have any expertise on DeepVariant/variant benchmarking around me :(

I wanted to benchmark identified variants using hap.py as you suggest in the tutorial, and the output surprised me a bit. For the variants I called the recall/precision were from 0 to 0.4, which is very low... I did not change default pacbio parameters when running DeepVariant, except for asking to keep supplementary alignments. Maybe it is partially because of the complexity of HLA region, but I am not sure what is the reason. What do you think, is there something obviously wrong?

singularity exec --bind /usr/lib/locale/ \
docker://google/deepvariant:${BIN_VERSION} \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type PACBIO \
    --ref $REFERENCE \
    --reads $BAM_FILE \
    --make_examples_extra_args=keep_supplementary_alignments=true \
    --output_vcf  $VCF_FILE \
    --num_shards 12 \
    --regions chr6:32541543-32701886

I am using version 1.6.1.

Kind regards, Alisa

Oct 18 '24 12:10 alisamatisse

Great question, always happy to help!

Setting keep_supplementary_alignments=true shouldn't dramatically change the accuracy.

Is it possible for you to share the BAM and VCF/truth set used? One possibility is that the truth set doesn't match HG003. For example if you run happy for HG003 using HG002 truth set the accuracy would be around 0.4.

Oct 18 '24 17:10 lucasbrambrink

That particular region in HLA is likely to have multiple different alleles mapped to it due to the diversity of structural variation in the DRB loci. Hg38 (no_alts) only has DRB5 in chr6. If your sample has DRB1 alleles that are linked to DRB3/4, these reads often map here.

(see fig 4) https://pmc.ncbi.nlm.nih.gov/articles/PMC10849470/

Oct 25 '24 17:10 jrharting

Hi Lucas and John,

Thank you so much for your input. I’m currently trying to figure out what’s happening… with different references..!

Kind regards, Alisa

Oct 25 '24 22:10 alisamatisse

Hey, sorry to reopen the issue. I called variants against HLA class II genes region for HG003 using HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed

and have very low precision values :( probably I am doing wrong variant calling in the HLA... also for some reason my coverage for this region is also very low, although WG is okay. I am not sure if it was reported elsewhere.

sorry for dumb questions, I am trying to figure this out on my own (and with AI)

Mar 07 '25 17:03 alisamatisse

@alisamatisse ,

You need to provide the HLA bed with -T option in hap.py cause it seems like it's comparing against everything. Can you paste your hap.py command here?

Mar 11 '25 17:03 kishwarshafin

Hi @kishwarshafin,

I think the problem is that it is HLA... I did probably very stupid reference-guided alignement and actually received lower coverage for the whole HLA, so starting from this deepvariant cannot call variants properly. However, this strategy worked for another less complicated region.

<...>

REGION="chr6:29668433-33512544"

<...>


    singularity exec docker://jmcdani20/hap.py:v0.3.12 \
        /opt/hap.py/bin/hap.py \
        -f "${BENCHMARK_BED}" \
        -r "${REFERENCE}" \
        -o "${OUTPUT_FILE}" \
        --engine="${ENGINE}" \
        ${PASS_ONLY} \
        -l "${REGION}" \
        test_demo/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
        "${VCF}"

Mar 19 '25 15:03 alisamatisse