deepvariant icon indicating copy to clipboard operation
deepvariant copied to clipboard

Output files are missing after running deepvariant.

Open githubtefo opened this issue 1 year ago • 9 comments

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.6.1/docs/FAQ.md:

Describe the issue: After running deepvariant in a docker container twice, the output dir in which I expect the output.g.vcf.gz and output.vcf.gz files, is empty. The /tmp/ folder doesn't contain any intermediate files neither.

Setup

  • Operating system: Ubuntu 22.04 LTS
  • DeepVariant version:
  • Installation method (Docker, built from source, etc.): Docker
  • Type of data: (sequencing instrument, reference genome, anything special that is unlike the case studies?) WGS HiFi PacBio

Steps to reproduce:

  • Command: sudo docker run -v /media/USER/Expansion/DATA/hifi_reads:/input -v /home/st/Applications/deepvariant:/reference -v $(pwd)/output:/output google/deepvariant /opt/deepvariant/bin/run_deepvariant --model_type=PACBIO --ref=/reference/Homo_sapiens.GRCh37.dna.primary_assembly.fa --reads=/input/DATA_s1.hifi_reads_sorted.bam --output_vcf=/output/output.vcf.gz --output_gvcf=/output/output.g.vcf.gz --num_shards=$(nproc)
  • Error trace: No errors.

Any additional context: Previously, I used pbmm2 to align and sort my raw BAM file.

githubtefo avatar Apr 19 '24 18:04 githubtefo

@githubtefo , you are providing the raw read inputs to DeepVariant? I requires reads aligned to the reference. Also, please see the logs after you run DeepVariant in the output directory.

kishwarshafin avatar Apr 19 '24 21:04 kishwarshafin

Hi @kishwarshafin, thanks for your reply. I am providing sorted and aligned inputs to DeepVariant, generated by pbmm2. I am not seeing any logs in the output dir, it is completely empty :(

githubtefo avatar Apr 21 '24 15:04 githubtefo

Hi @githubtefo , When you run the command, you should have logs in the terminal. Can you provide those logs?

And, can you try something simple like https://github.com/google/deepvariant/blob/r1.6.1/docs/deepvariant-pacbio-model-case-study.md first? That should certainly have logs when you run it. If not, there's something else wrong in your environment that we need to understand first.

pichuan avatar Apr 21 '24 16:04 pichuan

Sure! please find attached the logs in the terminal. In the meantime, I will run the simple case study. Thank you! output.log

githubtefo avatar Apr 21 '24 16:04 githubtefo

Thank you @githubtefo for the log.

The last step certainly looks strange to me:

***** Running the command:*****
time /opt/deepvariant/bin/postprocess_variants --ref "/reference/Homo_sapiens.GRCh37.dna.primary_assembly.fa" --infile "/tmp/tmpba2iuryg/call_variants_output.tfrecord.gz" --outfile "/output/output.vcf.gz" --cpus "16" --gvcf_outfile "/output/output.g.vcf.gz" --nonvariant_site_tfrecord_path "/tmp/tmpba2iuryg/[email protected]"

I0418 11:23:10.756783 140607260936000 postprocess_variants.py:1211] Using sample name from call_variants output. Sample name: 15337_Control
2024-04-18 11:23:10.761841: I deepvariant/postprocess_variants.cc:94] Read from: /tmp/tmpba2iuryg/call_variants_output-00000-of-00001.tfrecord.gz
2024-04-18 11:23:39.285004: I deepvariant/postprocess_variants.cc:109] Total #entries in single_site_calls = 8753635
I0418 11:24:21.877126 140607260936000 postprocess_variants.py:1313] CVO sorting took 1.1852648933728536 minutes
I0418 11:24:21.877437 140607260936000 postprocess_variants.py:1316] Transforming call_variants_output to variants.
I0418 11:24:21.877472 140607260936000 postprocess_variants.py:1318] Using 16 CPUs for parallelization of variant transformation.
I0418 11:24:35.259627 140607260936000 postprocess_variants.py:1211] Using sample name from call_variants output. Sample name: 15337_Control

real	2m10.376s
user	1m41.800s
sys	0m24.640s

I would expect postprocess_variants to be doing more. I believe @lucasbrambrink is looking into another issue in postprocess_variants related to multiprocessing (https://github.com/google/deepvariant/issues/804), but I'm not sure if this relevant.

For now, can you try disable multiprocessing in the postprocess_variants step by setting --cpus 0. If you're using the run_deepvariant one-step script, you can add:

--postprocess_variants_extra_args="cpus=0"

and see if that works for you. Please let us know.

We'll try to make this more robust in the future!

pichuan avatar Apr 21 '24 17:04 pichuan

Thanks, I will try that. BTW, how long will it take to run without multiprocessing? It is already 16h.

githubtefo avatar Apr 22 '24 13:04 githubtefo

@githubtefo It might be worth trying a smaller chromosome (like chr20, or try Quick Start) to make sure that things work end-to-end on your machine first. If you use --postprocess_variants_extra_args="cpus=0" , it's only the last step (postprocess_variants) that will be without multiprocessing. That step takes < 1hr for our PacBio BAM - you can see in https://github.com/google/deepvariant/blob/r1.5/docs/metrics.md#runtime-2 (this is before multiprocessing was used in postprocess_variants.

@githubtefo We understand that speeding up DeepVariant is very important. We're actively making improvements! Thank you for reporting the issue. Our future releases will be better because of your feedback!

pichuan avatar Apr 22 '24 15:04 pichuan

Thanks, @pichuan. Always happy to contribute to the open source community. I ran twice the command with the new argument and in bot cases it failed :( the external hard drive where I have allocated the bam file got ejected and a output2.log lso I noticed that the syslog and kern.log became insanely big (~200GB), leaving me with no extra disk space. Any ideas what might be going on?

githubtefo avatar Apr 30 '24 15:04 githubtefo

Hi @githubtefo , if the system resource is limited, I suggested that it's easier to split by chromosome.

For example, start with --regions chr1 in your first run, then chr2, etc etc.

This way, in each run you'll get a VCF for each chromosome, and at the end you can still have all the calls for the whole genome.

This is a bit manual, but if you do that, you'll also be able to parallelize if you have multiple machines to run on.

You can also start with a smaller chromosome, for example you can start with --regions chr22 and work your way back. That's probably even better because you can make sure the smaller chromosomes work before you try the larger ones.

Let me know if this works!

pichuan avatar Apr 30 '24 15:04 pichuan

Closing for inactivity. Please feel free to reopen if needed.

akolesnikov avatar May 15 '24 19:05 akolesnikov