PyVCF icon indicating copy to clipboard operation
PyVCF copied to clipboard

GATKCommandLine.VariantFiltration header line gets truncated

Open personalis opened this issue 9 years ago • 1 comments

When the VCF Reader parses our VCF's header, there is a particular line that gets truncated. The line is added by GATK (sorry, the line is very long):

##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.4-0-g7e26428,Date="Mon Apr 18 01:20:42 PDT 2016",Epoch=1460967642708,CommandLineOptions="analysis_type=VariantFiltration input_file=[] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=(path_redacted)/hs37d5.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=(path_redacted)/NA12878.gatk.VQSR.indel.vcf) mask=(RodBinding name= source=UNBOUND) out=(path_redacted)/NA12878.gatk.VQSR.indel.hardfilter.vcf filterExpression=[QD < 2.0 || ReadPosRankSum < -20.0 || InbreedingCoeff < -0.8 || FS > 200.0] filterName=[INDEL_SPECIFIC_FILTERS] genotypeFilterExpression=[] genotypeFilterName=[] clusterSize=3 clusterWindowSize=0 maskExtension=0 maskName=Mask filterNotInMask=false missingValuesInExpressionsShouldEvaluateAsFailing=false invalidatePreviousFilters=false invert_filter_expression=false invert_genotype_filter_expression=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">

Specifically, the truncated line ends with "filterExpression=[QD >". Note that the "<" in the original line changed to ">", and that is now the last character in the line. I suspect that the parser is trying to interpret these less-than and greater-than symbols as angle-bracket tags.

personalis avatar Apr 19 '16 04:04 personalis

Thanks for the report. Which version of PyVCF are you using? I believe this has been fixed in the latest version (0.6.8).

martijnvermaat avatar Apr 19 '16 10:04 martijnvermaat