Minimac4 icon indicating copy to clipboard operation
Minimac4 copied to clipboard

Non-adjacent blocks from --compress-reference

Open andrew-slater opened this issue 1 year ago • 3 comments

I'm attempting to replicate the Michigan Imputation Server locally. Since we will be focusing on a few loci in small sample sizes, seems best to just run Eagle and Minimac directly instead of through Cloudgene. I downloaded the latest Minimac4 release (4.1.6) but wanted to sanity check that it produced the same results as the version being reported by the Imputation Server (4-1.0.2).

We're using GRCh38 and I could not find any existing M3VCF/MSAV files so created from the 1000 Genomes release using Minimac3 to create the M3VCF and Minimac4.1.6 to create the MSAV (--compress-reference). I found some discordance in the genotypes imputed from the two Minimac4 versions and, out of curiosity, created a new MSAV from the M3VCF via --update-m3vcf in Minimac4.1.6. Using this M3VCF-sourced MSAV, the imputation results between 4.1.6 and 4-1.0.2 are much more similar. I used savvy to export the 2 MSAV files to VCF and found the difference seems to be in the block structure:

$ zgrep $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAV-from-M3VCF.vcf.gz | cut -f 1-10 | head
1       10416   1:10416 CCCTAA  <BLOCK> .       .       END=62157;VARIANTS=27;REPS=178  UHM     106
1       62157   1:62157 G       <BLOCK> .       .       END=80454;VARIANTS=34;REPS=200  UHM     12
1       80454   1:80454 G       <BLOCK> .       .       END=84139;VARIANTS=25;REPS=180  UHM     0
1       84139   1:84139 A       <BLOCK> .       .       END=89744;VARIANTS=44;REPS=165  UHM     0
1       89744   1:89744 A       <BLOCK> .       .       END=494542;VARIANTS=22;REPS=138 UHM     22
1       494542  1:494542        G       <BLOCK> .       .       END=600536;VARIANTS=26;REPS=125 UHM     0
1       600536  1:600536        A       <BLOCK> .       .       END=638772;VARIANTS=37;REPS=145 UHM     0
1       638772  1:638772        A       <BLOCK> .       .       END=771717;VARIANTS=17;REPS=92  UHM     0
1       771717  1:771717        C       <BLOCK> .       .       END=785674;VARIANTS=58;REPS=103 UHM     11
1       785674  1:785674        T       <BLOCK> .       .       END=790100;VARIANTS=51;REPS=122 UHM     6
$ zgrep $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAVvia--compress-reference.vcf.gz | cut -f 1-10 | head
1       10416   .       CCCTAA  <BLOCK> .       .       END=66264;VARIANTS=30;REPS=206  UHM     0,5
1       66269   .       A       <BLOCK> .       .       END=80454;VARIANTS=30;REPS=183  UHM     0,4
1       82163   .       G       <BLOCK> .       .       END=86028;VARIANTS=30;REPS=274  UHM     0,131
1       86065   .       G       <BLOCK> .       .       END=101550;VARIANTS=50;REPS=204 UHM     0,0
1       120983  .       C       <BLOCK> .       .       END=597799;VARIANTS=30;REPS=239 UHM     0,1
1       597818  .       C       <BLOCK> .       .       END=666203;VARIANTS=40;REPS=203 UHM     0,0
1       668135  .       C       <BLOCK> .       .       END=785203;VARIANTS=70;REPS=243 UHM     0,6
1       785417  .       G       <BLOCK> .       .       END=790082;VARIANTS=50;REPS=124 UHM     0,109
1       790099  .       A       <BLOCK> .       .       END=798782;VARIANTS=50;REPS=115 UHM     0,23
1       798941  .       G       <BLOCK> .       .       END=809630;VARIANTS=50;REPS=118 UHM     0,5

The M3VCF-sourced MSAV has adjacent blocks (the END INFO field is the POS value of the next block) but the MSAV created directly from the genotype VCF does not. And since the blocks are not adjacent, there is no common overlapping variant:

zgrep -C 1 $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAVvia--compress-reference.vcf.gz | cut -f 1-10 | tail -n +5 | head -n 11
1       66264   .       A       T       .       .       AC=29;AN=5096;UHA=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1
1       66269   .       A       <BLOCK> .       .       END=80454;VARIANTS=30;REPS=183  UHM     0,4
1       66269   .       A       T       .       .       AC=25;AN=5096;UHA=0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
--
1       80454   .       G       C       .       .       AC=6;AN=5096;UHA=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
1       82163   .       G       <BLOCK> .       .       END=86028;VARIANTS=30;REPS=274  UHM     0,131
1       82163   .       G       A       .       .       AC=170;AN=5096;UHA=0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
--
1       86028   .       T       C       .       .       AC=195;AN=5096;UHA=0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1
1       86065   .       G       <BLOCK> .       .       END=101550;VARIANTS=50;REPS=204 UHM     0,0
1       86065   .       G       C       .       .       AC=196;AN=5096;UHA=0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0

Perhaps the MSAV format does not require the same block adjacency as the M3VCF format does? But the differing imputation results seem to indicate this differing format has an effect. Is this expected?

andrew-slater avatar Apr 08 '24 02:04 andrew-slater

Minimac v4.0.x expected first variant in a block to be a duplicate of the last variant in the previous block. Minimac v4.1.x does not have this expectation and will filter out the duplicates if they are encountered (https://github.com/statgen/Minimac4/blob/master/src/unique_haplotype.cpp#L522-L524).

What command did you run to generate the b38 M3VCF? I suspect that, if you looked at the non-block records in the M3VCF file, you will see ERR and RECOM INFO fields. These fields will not exist in the "--compress-reference" version. These are parameter estimates that will improve the accuracy of imputation when using 1KG as a reference panel. zgrep -v $'\t<BLOCK' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAV-from-M3VCF.vcf.gz | cut -f 1-9 | head

Otherwise, can you elaborate on the discordance you are seeing? There are expected to be small differences between v4.0.x and v4.1.x.

jonathonl avatar Apr 08 '24 12:04 jonathonl

Thanks for the quick reply Jonathan. Glad to hear this format is not unexpected for MSAV. I created the M3VCF using this command:

Minimac3 --processReference --refHaps ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.vcf.gz --prefix ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons

And I do see the parameter estimates in the M3VCF-sourced MSAV:

$ zgrep -v '^#' ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingltons.MSAV-from-M3VCF.vcf.gz  | cut -f 1-9 | head -n 3
1       10416   1:10416 CCCTAA  <BLOCK> .       .       END=62157;VARIANTS=27;REPS=178  UHM
1       10416   1:10416 CCCTAA  C       .       .       AC=240;AN=5096;ERR=0.0054688;RECOM=0.00050835;UHA=0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1       16103   1:16103 T       G       .       .       AC=118;AN=5096;ERR=0.0071291;RECOM=0.00050835;UHA=0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Is there a reason these are not calculated/included in the MSAV created by --compress-reference?

The first instance of discordance I found was with a TYPED variant where one of the assayed genotypes was overwritten by 4.1.6 using the --compress-reference MSAV but maintained by 4-1.0.2 using the M3VCF and 4.1.6 using the M3VCF-derived MSAV. Since the assayed data is from deep NGS and thus likely accurate (I haven't dug into the underlying read data to inspect further), seems this supports your hypothesis of the improved accuracy from the M3VCF parameter estimates. It is a rarish variant (AF 0.005) which perhaps benefits more from this accuracy improvement?

I'll plan to use the M3VCF with parameter estimates for our imputation jobs.

andrew-slater avatar Apr 08 '24 17:04 andrew-slater

Is there a reason these are not calculated/included in the MSAV created by --compress-reference?

Minimac4 was designed for large reference panels (>100,000 samples) and the parameter estimation is less beneficial at this scale and not tractable.

By the way, you can look at ER2 INFO field for the TYPED sites to see how correlated the imputation dosages are with the assayed genotypes.

jonathonl avatar Apr 08 '24 22:04 jonathonl