deepvariant Varying the channels used to call variants

Describe the issue: I previously used the following PopVCF model.ckpt with run_deepvariant v.1.1 while including a PopVCF channel during make_examples. However, that model does not include a channel for insert_size as their work predates v1.4.

With the default extra channel for 'insert_size' in v1.4, and make_examples having numerous options to include additional channels:

--[no]use_allele_frequency: If True, add another channel for pileup images to represent allele frequency information gathered from population call sets.
    (default: 'false')
--[no]add_hp_channel: If true, add another channel to represent HP tags per read.
    (default: 'false')
--channels: Comma-delimited list of optional channels to add. Available Channels: read_mapping_percent,avg_base_quality,identity,gap_compressed_identity,gc_content,is_homopolymer,homopolymer_weighted,blank,insert_size

Are there model-ckpt files for these channel options available somewhere to provide call_variants via:

--checkpoint: Required. Path to the TensorFlow model checkpoint to use to evaluate candidate variant calls.

If so, do they include one additional channel or permutations of multiple channels?

If not, is there an alternative way to have run_deepvariant use different channels than what the default checkpoint contains during call_variants? For example, I am currently unable to include both insert_size and allele_frequency with v1.4

Setup

Operating system:
DeepVariant version: v1.4
Installation method (Docker, built from source, etc.): Singularity
Type of data: WGS

Steps to reproduce:

Command:

time singularity run -B '/usr/lib/locale/:/usr/lib/locale/,/path/to/region_files/:/region_dir/,/path/to/container/deep-variant/:/run_dir/,/path/to/output/:/path/to/reference_genome/:/ref_dir/,/path/to/bam_files/:/bam_dir/,/path/to/population_vcf/:/popVCF_dir/' 

  deepvariant_1.4.0.sif   
      /opt/deepvariant/bin/run_deepvariant 
      --model_type=WGS
      --ref='/ref_dir/reference.fa' 
      --reads='/bam_dir/id.bam' 
      --output_vcf='/out_dir/test1.vcf.gz' 
      --intermediate_results_dir='/out_dir/tmp/test1/' 
      --num_shards='39' 
      --make_examples_extra_args="use_allele_frequency=true,population_vcfs=/popVCF_dir/UMAG1.POP.FREQ.vcf.gz" 
      --regions=/region_dir/regions_to_test.bed

Error trace: (if applicable)

***** Running the command:*****
time /opt/deepvariant/bin/call_variants --outfile "/out_dir/tmp/test1/call_variants_output.tfrecord.gz" --examples "/out_dir/tmp/test1/[email protected]" --checkpoint "/opt/models/wgs/model.ckpt" --openvino_model_dir "/out_dir/tmp/test1/"

I0919 17:19:47.185331 46912500266816 call_variants.py:317] From /out_dir/tmp/test1/make_examples.tfrecord-00000-of-00039.gz.example_info.json: Shape of input examples: [100, 221, 8], Channels of input examples: [1, 2, 3, 4, 5, 6, 8, 19].
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_l3__pco1/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 513, in <module>
    tf.compat.v1.app.run()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/tmp/Bazel.runfiles_l3__pco1/runfiles/absl_py/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/tmp/Bazel.runfiles_l3__pco1/runfiles/absl_py/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/tmp/Bazel.runfiles_l3__pco1/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 494, in main
    call_variants(
  File "/tmp/Bazel.runfiles_l3__pco1/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 363, in call_variants
    raise ValueError('The number of channels in examples and checkpoint '
ValueError: The number of channels in examples and checkpoint should match, but the checkpoint has 7 channels while the examples have 8.

real    0m3.217s
user    0m4.066s
sys     0m4.174s

real    77m45.059s
user    2960m49.979s
sys     39m40.911s```

Sep 19 '22 18:09 jkalleberg

Hi @jkalleberg , Thank you for reaching out.

The model ckpt you're using was older than v1.4.

And you're right: in v1.4 we added an extra channel, and we haven't trained a new allele frequency model with v1.4. So we actually don't yet have a model that has both insert_size as well allele_frequency!

Two things:

If you want to run v1.4.0 code with the older model (which didn't have the insert_size channel), you can add ,channels='' to the end of your make_examples_extra_args. I added a section to my public gist here: https://gist.github.com/pichuan/64d73bc965300645470eb29a66116593#update-if-youre-using-our-v140-docker-codebase
I'm currently training a new WGS AF model that will have both the insert_size channel, as well as the allele_frequency channel. So, stay tuned! I can give you an update when I have it.

Sep 19 '22 22:09 pichuan

@pichuan Thank you for the quick reply. That's a good tip to bypass the insert_size channel by default. Looking forward to your update, much appreciated.

If I understand correctly, each additional channel requires building a model using examples containing those channels. As a hypothetical example, to use insert_size + allele_frequency + avg_base_quality during variant calling would require re-training, correct?

I'm trying to understand if additional channels are mutually exclusive choices for the 7th channel when using the run_deepvariant command. My first attempt at running with both insert_size + allele_frequency seemed to work. However, it produced examples with channels [1, 2, 3, 4, 5, 6, 19] instead of [1, 2, 3, 4, 5, 6, 8, 19]. I would have expected an error, yet call_variants produced a vcf output despite not having an insert_size channel. Did allele_frequency replace the 7th channel correctly? Or did it somehow encode allele_frequency data as insert_size, if that makes sense?

Sep 19 '22 22:09 jkalleberg

@pichuan Thank you for the quick reply. That's a good tip to bypass the insert_size channel by default. Looking forward to your update, much appreciated.

If I understand correctly, each additional channel requires building a model using examples containing those channels. As a hypothetical example, to use insert_size + allele_frequency + avg_base_quality during variant calling would require re-training, correct?

Yes. The make_examples stage will need to create examples that are consistent with the model.ckpt used in call_variants. Because the model.ckpt was already trained with a specific list of channels and shape.

So, if you create different examples - even if it's just to remove a channel, you're suppose to retrain on examples that are made with that channel removed as well.

I'm trying to understand if additional channels are mutually exclusive choices for the 7th channel when using the run_deepvariant command. My first attempt at running with both insert_size + allele_frequency seemed to work. However, it produced examples with channels [1, 2, 3, 4, 5, 6, 19] instead of [1, 2, 3, 4, 5, 6, 8, 19]. I would have expected an error, yet call_variants produced a vcf output despite not having an insert_size channel. Did allele_frequency replace the 7th channel correctly? Or did it somehow encode allele_frequency data as insert_size, if that makes sense?

If you made examples with 8 channels, but the model has 7 channels, the call_variants step should have errors like you've shared in your original post:

ValueError: The number of channels in examples and checkpoint should match, but the checkpoint has 7 channels while the examples have 8.

If you make examples with 7 channels, but your model.ckpt has different 7 channels - then it depends. Starting from v1.4.0, we keep a model.ckpt.example_info.json file together with the model. For example:

$ gsutil cat gs://deepvariant/models/DeepVariant/1.4.0/DeepVariant-inception_v3-1.4.0+data-wgs_standard/model.ckpt.example_info.json
{"version": "1.4.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}

If the customized_model you're using has this file, and if the examples you're making ends up having 7 channels but different ones, then it should still give you an error. But, if you're using an older model (such as the AF model you're using), and if it didn't have a model.ckpt.example_info.json file with it, then I believe the current behavior is that it will try to use the 7 channels directly. Which will cause the issue that it might use the weights for allele_frequency channel for the insert_size channel.

Does that make sense?

By the way, in case you haven't seen it, we have a nice blog post that talks about channels: https://google.github.io/deepvariant/posts/2022-06-09-adding-custom-channels/

Sep 20 '22 05:09 pichuan

@pichuan thank you for the discussion. And yes, that does make sense. That was my expectation, but I hadn't seen that blog post yet, so again, thanks!

Final question: besides insert_size, do any of these channels listed have model.ckpt files on GCP, like the old PopVCF one I used?

--channels: Comma-delimited list of optional channels to add. Available Channels: read_mapping_percent,avg_base_quality,identity,gap_compressed_identity,gc_content,is_homopolymer,homopolymer_weighted,blank,insert_size

I assume they do not, or they'd be listed here, correct? I want to confirm I'm keeping an eye out in the right place if/when any new checkpoints become available.

Sep 20 '22 14:09 jkalleberg

New model checkpoints associated with new releases will be under gs://deepvariant/models/DeepVariant as you noticed.

I mentioned that starting from v1.4.0, you can see this file:

$ gsutil cat gs://deepvariant/models/DeepVariant/1.4.0/DeepVariant-inception_v3-1.4.0+data-wgs_standard/model.ckpt.example_info.json
{"version": "1.4.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}

The "channels" values are enums. You can look them up in this proto: https://github.com/google/deepvariant/blob/r1.4/deepvariant/protos/deepvariant.proto#L1048

From the example above, it's saying that DeepVariant v1.4.0 WGS model has 7 channels, and they are:

  CH_READ_BASE = 1;
  CH_BASE_QUALITY = 2;
  CH_MAPPING_QUALITY = 3;
  CH_STRAND = 4;
  CH_READ_SUPPORTS_VARIANT = 5;
  CH_BASE_DIFFERS_FROM_REF = 6;
  CH_INSERT_SIZE = 19;

Note that the allele frequency model isn't part of our regular release process yet. It's made public as part of our preprint https://doi.org/10.1101/2021.01.06.425550. Right now, we're retraining it when users request it. We're certainly hoping to see more uses cases (thank you for letting us know!). If it's become more mature, we can consider building it into part of our regular release process. (Adding more regular supports also means more overhead for each release, so we need to balance this carefully.)

Sep 21 '22 05:09 pichuan

Hi @jkalleberg , please see See: https://gist.github.com/pichuan/7ad09bf1fa8f519facf6806eca835ea6

I'll close this issue for now. Feel free to open more issues if you have any questions or feedback for us.

Sep 24 '22 02:09 pichuan