samplot icon indicating copy to clipboard operation
samplot copied to clipboard

two bugs when parsing vcf

Open johannesgeibel opened this issue 5 years ago • 3 comments

HI, I actually found two bugs when parsing vcfs with samplot vcf (clean conda installation of samplot 1.0.19). As my python knowledge is quite bad, I cannot tell where the problems exactly came from, but I found working bug fixes: The first bug appeared when parsing CIPOS and CIEND.

cipos = "--start_ci '%s,%s'" % (abs(v[0]), abs(v[1]))

somehow does not import v[0]and v[1] as integers, so this fix helped:

cipos = "--start_ci '%s,%s'" % (abs(int(v[0])), abs(int(v[1])))

The second one only appeared for vcfs which contained BND calls. variant.info.get("CHR2") somehow contained a value for every record and therefore, the length of all variants was set to 1. Setting translocation_chrom to None and and only overwriting it if svtype was "BND" or "TRA" helped to solve the problem:

for var_count,variant in enumerate(vcf):
        translocation_chrom = None
        svtype = variant.info.get("SVTYPE", "SV")
        if svtype in ("BND","TRA"):
            try:
                translocation_chrom = variant.info.get("CHR2")
            except:
                translocation_chrom = None

Hope, this helps to solve problems for others, too. Johannes

johannesgeibel avatar Oct 16 '20 13:10 johannesgeibel

I've added changes similar to these in v1.0.20. Please lmk if that version solves your issues and thanks for pointing them out!

jbelyeu avatar Oct 21 '20 14:10 jbelyeu

Thanks a lot. Yes, they initially solved my problem, but I was running into some more during the last days. First one was that I also wanted to see the location of CHR2 for BNDs. I therefore needed to extract the position on CHR2 from the ALT field and use it to set up start2 and end2. However, this only works if the ALT field is correctly coded which is probably not the case for all callers (even so it should according to the vcf format specifications). Note that my fix works only for contigs named "chr..". The other problem was that I had some BNDs mapping directly at the beginning of CHR2 and the CIEND interval should have started before the chromosome (of course something which should be filtered anyway and probably due to strand switches and that SURVIVOR which produced my consensus vcf does encode the CIEND for the "-" strand in this cases - couldn't it figure out yet). This caused samplot plot to crash somewhere in matplotlib. I therefore set v[0] to zero in those cases. I'm not sure whether you want (and have an idea on how) to include those things in a generalized way into the code. I therefore added my complete changed samplot_vcf.py, so that you can check through.

samplot_vcf.zip

johannesgeibel avatar Oct 22 '20 13:10 johannesgeibel

I'll include a catch for cases where the confidence interval is outside the genomic range in another release soon. Generally we don't want to support incorrectly formatted VCFs as that opens a giant can of worms. I'd recommend raising an issue on the tool that created the VCF with issues (SURVIVOR).

jbelyeu avatar Oct 22 '20 17:10 jbelyeu