sgkit
sgkit copied to clipboard
Example tetraploid and pseudo mixed-ploidy VCFs
I've finally had time to put together an example tetraploid and pseudo mixed-ploidy VCF example with some previously published potato data: https://github.com/pfrnz/Example-Tetraploid-Potato-VCF-PRJNA414303.
Just make note of the issue in the readme regarding freebayes representation of null calls as haploid calls. This is only an issue in sgkit if the VCF has haploid null calls and the dataset is specified as mixed-ploidy. You can avoid the issue by using filtered VCF with all null calls removed.
Also worth noting that these VCFs include some small structural variants.
import sgkit as sg
from sgkit.io.vcf import vcf_to_zarr
VCF = "PRJNA414303.CHR5.mixed.filterNullGT.vcf.gz"
ZARR = "PRJNA414303.CHR5.mixed.filterNullGT.zarr"
vcf_to_zarr(
input=VCF,
output=ZARR,
ploidy=4,
mixed_ploidy=True,
max_alt_alleles=10,
field_defs=[
"INFO/*",
"FORMAT/GQ",
"FORMAT/DP",
"FORMAT/AD",
"FORMAT/RO",
]
)
ds = sg.load_dataset(ZARR)
ds = sg.infer_sample_ploidy(ds)
print(ds.sample_ploidy.compute())
<xarray.DataArray 'sample_ploidy' (samples: 38)>
array([4, 4, 4, 4, 2, 4, 2, 4, 4, 4, 2, 2, 4, 4, 2, 4, 2, 4, 2, 4, 2, 4,
2, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4])
Dimensions without coordinates: samples
Attributes:
comment: Ploidy of each sample calculated from call genotypes across all...