somaticseq icon indicating copy to clipboard operation
somaticseq copied to clipboard

SEQC2: Some high confidence SNVs and INDELs in VCF are outside of regions defined by High-Confidence_Regions_v1.2.bed

Open luederm opened this issue 3 years ago • 1 comments

I downloaded high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz and high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz from FTP but noticed that some of the variants are not in the regions defined by High-Confidence_Regions_v1.2.bed. This led to issues when I compared my results (after filtering using the supplied BED file) to the HC reference call set.

bcftools view -T ^High-Confidence_Regions_v1.2.bed high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz | bcftools +counts
Number of samples: 0
Number of SNPs:    113
Number of INDELs:  0
Number of MNPs:    0
Number of others:  0
Number of sites:   113
bcftools view --targets-overlap 0 -T ^High-Confidence_Regions_v1.2.bed high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz | bcftools +counts
Number of samples: 0
Number of SNPs:    0
Number of INDELs:  320
Number of MNPs:    0
Number of others:  0
Number of sites:   320
bcftools view --targets-overlap 1 -T ^High-Confidence_Regions_v1.2.bed high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz | bcftools +counts
Number of samples: 0
Number of SNPs:    0
Number of INDELs:  297
Number of MNPs:    0
Number of others:  0
Number of sites:   297

luederm avatar Sep 27 '22 15:09 luederm

Thanks for pointing that out to me. Some calls outside the high confidence regions were left in those files. I'll make a note of that in README and release corrected versions of those high-confidence-sSNV/INDEL files.

litaifang avatar Sep 27 '22 20:09 litaifang

SEQC2 has an update with README.md there explaining the differences (and why): https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/release/latest/

litaifang avatar Oct 21 '22 15:10 litaifang