SEQC2: Some high confidence SNVs and INDELs in VCF are outside of regions defined by High-Confidence_Regions_v1.2.bed
I downloaded high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz and high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz from FTP but noticed that some of the variants are not in the regions defined by High-Confidence_Regions_v1.2.bed. This led to issues when I compared my results (after filtering using the supplied BED file) to the HC reference call set.
bcftools view -T ^High-Confidence_Regions_v1.2.bed high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz | bcftools +counts
Number of samples: 0
Number of SNPs: 113
Number of INDELs: 0
Number of MNPs: 0
Number of others: 0
Number of sites: 113
bcftools view --targets-overlap 0 -T ^High-Confidence_Regions_v1.2.bed high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz | bcftools +counts
Number of samples: 0
Number of SNPs: 0
Number of INDELs: 320
Number of MNPs: 0
Number of others: 0
Number of sites: 320
bcftools view --targets-overlap 1 -T ^High-Confidence_Regions_v1.2.bed high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz | bcftools +counts
Number of samples: 0
Number of SNPs: 0
Number of INDELs: 297
Number of MNPs: 0
Number of others: 0
Number of sites: 297
Thanks for pointing that out to me. Some calls outside the high confidence regions were left in those files. I'll make a note of that in README and release corrected versions of those high-confidence-sSNV/INDEL files.
SEQC2 has an update with README.md there explaining the differences (and why):
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/release/latest/