Exomiser icon indicating copy to clipboard operation
Exomiser copied to clipboard

Exomiser v14.0.0 is not recognizing unstructured key meta-information as a valid header line.

Open ShrutiMarwaha opened this issue 1 year ago • 3 comments

Hi there,

  1. Does exomiser support only certain vcf file format and above?
  2. Exomiser v14.0.0 is not recognizing unstructured key meta-information line with key as “##META” as a valid header line. I have some old vcfs that have 16 rows in header that start with “##META” and are unstructured meta information lines. However, this seems to be allowed in vcfv4.4 (page 5, section 1.4). When I run this vcf through exomiser, I get the following error: htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Invalid VCFSimpleHeaderLine: key=META name=null, for input source: file:///oak/stanford/groups/euan/UDN/gateway/data/UDN644400/WES/FromSequencingCore/WES_blood_hg19/Processed/UDN644400_family_merged.vcf.gz at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:97) ~[htsjdk-3.0.5.jar:3.0.5] at htsjdk.tribble.TabixFeatureReader.(TabixFeatureReader.java:82) ~[htsjdk-3.0.5.jar:3.0.5] at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:117) ~[htsjdk-3.0.5.jar:3.0.5] at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:81) ~[htsjdk-3.0.5.jar:3.0.5] …….. ……………………….
  • ##fileformat=VCFv4.0

  • Lines with #META in header from my vcf: ##META='Cassandra_version=15.4.29' ##META='Pileup_File=/stornext/snfswgl/next-gen/Illumina/Instruments/D00143/170130_D00143_0967_BHF7KHBCXY/Results/Project_170130_D00143_0967_BHF7KHBCXY/Sample_HF7KHBCXY-2-ID10/SNP/7 ##META='Annovar-refGene(hg19).Version=2013-08-23' ##META='Annovar-knownGene(hg19).Version=2013-08-23' ##META='Annovar-ensgene(hg19).Version=2013-08-23' ##META='Annovar-ensgene(GRCh37_MT).Version=2013-08-23' ##META='DbNSFP.Description=The dbNSFP is an integrated database of functional annotations from multiple sources for the comprehensive collection of human non-synonymous SNPs. v2.5. ##META='Hgmd.Database_version=null.Description=HGMD_PRO_2016.1.Downloaded=2016-07-8' ##META='1000 Genomes Phase 1.Description=SNPs Indels and SVs friom 1000 Genomes.Downloaded=2014-03-04' ##META='DbSNP.Description=NCBIs SNP database. v141 (GRCh37).Downloaded=2014-07-16' ##META='ARIC.Description=Allele freq from Aric cohort.Downloaded=2014-7-16' ##META='Mappability.Description=Encode 100bp alignability track. v1.Downloaded=2014-03-04' ##META='CgMaf.Description=Complete genomics variations from the reference genome identified across 54-genome subset of the 69 CG public genomes. Version 2.Downloaded=2014-03-04' ##META='ESP.Description=ESP5400 taken from 5400 samples drawn from multiple ESP cohorts and represents all of the ESP exome variant data. Version 1.Downloaded=2014-03-04' ##META='Encode.Description=Reglatory features from Encode. Taken from ensembl release 75.Downloaded=2014-03-04' ##META='Swissprot.Description=Uniprot gene annotation. Version 2014_02.Downloaded=2014-07-16' ##INFO=<ID=ReqIncl,Number=.,Type=String,Description="Site was required to be included in the VCF">

  • If I delete the rows with “##META” in the header of my vcf file, I can successfully run exomiser. However, I have several such vcf and do not want to create new vcfs with modified header. Is there a way to mitigate this?

Thanks, Shruti

Shruti Marwaha, PhD. Research Engineer, Stanford Center for Undiagnosed Diseases GREGoR (Genomics Research to Elucidate the Genetics of Rare disease) Stanford Site Stanford University

ShrutiMarwaha avatar May 09 '24 17:05 ShrutiMarwaha

That's weird. Did it run OK on earlier versions? Under the hood Exomiser uses the HTSJDK, so support for whatever version of VCF is entirely down to that. I think it only supports up to 4.2.

julesjacobsen avatar May 10 '24 10:05 julesjacobsen

Shruti, I'm pretty sure that the issue lies with META being a defined header key and therefore requires the more structured META=<> format.

In VCF 4.3 (https://samtools.github.io/hts-specs/VCFv4.3.pdf) under the changes section 7.2, page 37

Introduced ##META header lines for defining phenotype metadata

This is shown on page 7 section 1.4.8

1.4.8 Sample field format It is possible to define sample to genome mappings as shown below: ##META=<ID=Assay,Type=String,Number=.,Values=[WholeGenome, Exome]> ##META=<ID=Disease,Type=String,Number=.,Values=[None, Cancer]> ##META=<ID=Ethnicity,Type=String,Number=.,Values=[AFR, CEU, ASN, MEX]> ##META=<ID=Tissue,Type=String,Number=.,Values=[Blood, Breast, Colon, Lung, ?]> ##SAMPLE=<ID=Sample1,Assay=WholeGenome,Ethnicity=AFR,Disease=None,Description="Patient germline genome from unaffected",DOI=url> ##SAMPLE=<ID=Sample2,Assay=Exome,Ethnicity=CEU,Disease=Cancer,Tissue=Breast,Description="European patient exome from breast cancer">

So, this would mean that the line should have the form ##META=<ID....>, but this is for VCFv4.3. Your old files are v4.0, which your file states it is, and should therefore be considered legal.

I think HTSJDK effectively supports VCFv4.3 read and VCFv4.2 writing, which would explain why the error is happening. It would be more useful if they could precisely support the version stated in the header or throw an error about the type not matching the version they do fully support. What they actually support isn't clearly defined outside of checking that the file starts with the header "##fileformat=VCFv4".

julesjacobsen avatar May 10 '24 12:05 julesjacobsen

Thanks Jules.

ShrutiMarwaha avatar May 10 '24 19:05 ShrutiMarwaha