One region in Trio has very slow genotyping
Using a large library of breakpoints, a job was submitted on the GIAB HG002 sample for each chromosome using 8 threads. Most chromosomes completed in under 8 hours except chromosome 4 in one region. Chr 4 is still running after 2 more days. This also occurred with the two parents.
Any suggestions on this?
-rw-r--r-- 1 farrell casa 726 Jul 7 01:20 workarea/HG002/chr4/048000001-049000000.vcf.gz.tbi
-rw-r--r-- 1 farrell casa 2.8M Jul 7 01:20 workarea/HG002/chr4/048000001-049000000.vcf.gz
-rw-r--r-- 1 farrell casa 664 Jul 5 14:56 workarea/HG002/chr4/047000001-048000000.vcf.gz.tbi
-rw-r--r-- 1 farrell casa 2.0M Jul 5 14:56 workarea/HG002/chr4/047000001-048000000.vcf.gz
Hi,
I also occurred this problem before, all chromosmes finished genotyping except one chromosome. I tried to split the variants and ran graphtyper one by one. Maybe you can try. Also, you can check the read depth within this region using samtools tview.
Best, Zhuqing
What type of site should I look for that may cause the high CPU? There is quite a few variants in the region.
Is is high DP,# of multiallelic at site, large size of SV?
I would think this is happening in regions that have very high alignment depth. In earlier versions of graphtyper SV genotyping, we had a high coverage downsampling filter but later found out it was having a bad effect on quality so we turned it off.
I realize this is problematic so I will experiment if we can re-enable the filter but make it less aggressive than before.
Best, Hannes
@hannespetur Has there been any progress on testing a downsampling filter that is less aggressive?
I see that issue #58 has described the downsampling filter being tested using --avg_cov_by_readlen.
The --avg_cov_by_readlen option to subsample reads has been added to graphtyper v2.6.1.