sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Example tetraploid and pseudo mixed-ploidy VCFs

Open timothymillar opened this issue 4 years ago • 0 comments

I've finally had time to put together an example tetraploid and pseudo mixed-ploidy VCF example with some previously published potato data: https://github.com/pfrnz/Example-Tetraploid-Potato-VCF-PRJNA414303.

Just make note of the issue in the readme regarding freebayes representation of null calls as haploid calls. This is only an issue in sgkit if the VCF has haploid null calls and the dataset is specified as mixed-ploidy. You can avoid the issue by using filtered VCF with all null calls removed.

Also worth noting that these VCFs include some small structural variants.

import sgkit as sg
from sgkit.io.vcf import vcf_to_zarr

VCF = "PRJNA414303.CHR5.mixed.filterNullGT.vcf.gz"
ZARR = "PRJNA414303.CHR5.mixed.filterNullGT.zarr"

vcf_to_zarr(
    input=VCF,
    output=ZARR,
    ploidy=4,
    mixed_ploidy=True,
    max_alt_alleles=10,
    field_defs=[
        "INFO/*",
        "FORMAT/GQ",
        "FORMAT/DP",
        "FORMAT/AD",
        "FORMAT/RO",
    ]
)

ds = sg.load_dataset(ZARR)
ds = sg.infer_sample_ploidy(ds)
print(ds.sample_ploidy.compute())
<xarray.DataArray 'sample_ploidy' (samples: 38)>
array([4, 4, 4, 4, 2, 4, 2, 4, 4, 4, 2, 2, 4, 4, 2, 4, 2, 4, 2, 4, 2, 4,
       2, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4])
Dimensions without coordinates: samples
Attributes:
    comment:  Ploidy of each sample calculated from call genotypes across all...

timothymillar avatar May 31 '21 05:05 timothymillar