deepvariant icon indicating copy to clipboard operation
deepvariant copied to clipboard

DataLoss Error with Tensorflow

Open yangyxt opened this issue 3 years ago • 7 comments

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.4/docs/FAQ.md: Yes Describe the issue: At the call_variants.py step, running into error that tensorflow.python.framework.errors_impl.DataLossError: truncated record at 19179998357' failed with EOF reached (A clear and concise description of what the issue is.)

Setup

  • Operating system:CentOS7
  • DeepVariant version:1.4.0
  • Installation method (Docker, built from source, etc.):singularity run with SIF image pulled from docker://google/deepvariant:"${BIN_VERSION}"
  • Type of data: (sequencing instrument: BGI, reference genome: hg19, anything special that is unlike the case studies?)

Steps to reproduce:

  • Command:
  • singularity run \ -B "/paedyl01/disk1/yangyxt,/usr/lib/locale:/usr/lib/locale,/tmp:/paedyl01/disk1/yangyxt/test_tmp" \ --workdir /paedyl01/disk1/yangyxt \ ${image} \ /opt/deepvariant/bin/run_deepvariant \ --model_type=${model_type} \ --ref="${ref_genome}" \ --reads="${bam_file}" \ ${region_arg} \ --output_vcf="${output_vcf}" \ --output_gvcf="${output_gvcf}" \ --intermediate_results_dir "/paedyl01/disk1/yangyxt/test_tmp" \ --num_shards=${threads} && \ ls -lh ${output_vcf} && \ ls -lh ${output_gvcf}
  • Error trace: (if applicable)
  • `***** Running the command:***** time /opt/deepvariant/bin/call_variants --outfile "/paedyl01/disk1/yangyxt/test_tmp/call_variants_output.tfrecord.gz" --examples "/paedyl01/disk1/yangyxt/test_tmp/[email protected]" --checkpoint "/opt/models/wgs/model.ckpt" --openvino_model_dir "/paedyl01/disk1/yangyxt/test_tmp"

I0826 20:44:28.894064 47737984214848 call_variants.py:317] From /paedyl01/disk1/yangyxt/test_tmp/make_examples.tfrecord-00000-of-00014.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19]. I0826 20:44:28.898550 47737984214848 call_variants.py:317] From /opt/models/wgs/model.ckpt.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19]. 2022-08-26 20:44:28.903729: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-08-26 20:44:28.905866: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 3. Tune using inter_op_parallelism_threads for best performance. WARNING:tensorflow:Using temporary folder as model directory: /tmp/pbs.1173981.omics/tmpag6nq5vt W0826 20:44:28.952679 47737984214848 estimator.py:1864] Using temporary folder as model directory: /tmp/pbs.1173981.omics/tmpag6nq5vt INFO:tensorflow:Using config: {'_model_dir': '/tmp/pbs.1173981.omics/tmpag6nq5vt', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': , '_keep_checkpoint_max': 100000, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} I0826 20:44:28.953302 47737984214848 estimator.py:202] Using config: {'_model_dir': '/tmp/pbs.1173981.omics/tmpag6nq5vt', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': , '_keep_checkpoint_max': 100000, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} I0826 20:44:28.953605 47737984214848 call_variants.py:446] Writing calls to /paedyl01/disk1/yangyxt/test_tmp/call_variants_output.tfrecord.gz INFO:tensorflow:Calling model_fn. I0826 20:44:29.309295 47737984214848 estimator.py:1173] Calling model_fn. /usr/local/lib/python3.8/dist-packages/tf_slim/layers/layers.py:1083: UserWarning: layer.apply is deprecated and will be removed in a future version. Please use layer.__call__ method instead. outputs = layer.apply(inputs) /usr/local/lib/python3.8/dist-packages/tf_slim/layers/layers.py:678: UserWarning: layer.apply is deprecated and will be removed in a future version. Please use layer.__call__ method instead. outputs = layer.apply(inputs, training=is_training) /usr/local/lib/python3.8/dist-packages/tf_slim/layers/layers.py:2441: UserWarning: layer.apply is deprecated and will be removed in a future version. Please use layer.__call__ method instead. outputs = layer.apply(inputs) /usr/local/lib/python3.8/dist-packages/tf_slim/layers/layers.py:118: UserWarning: layer.apply is deprecated and will be removed in a future version. Please use layer.__call__ method instead. outputs = layer.apply(inputs) /usr/local/lib/python3.8/dist-packages/tf_slim/layers/layers.py:1638: UserWarning: layer.apply is deprecated and will be removed in a future version. Please use layer.__call__ method instead. outputs = layer.apply(inputs, training=is_training) INFO:tensorflow:Done calling model_fn. I0826 20:44:33.173107 47737984214848 estimator.py:1175] Done calling model_fn. INFO:tensorflow:Graph was finalized. I0826 20:44:34.048544 47737984214848 monitored_session.py:247] Graph was finalized. INFO:tensorflow:Restoring parameters from /opt/models/wgs/model.ckpt I0826 20:44:34.048974 47737984214848 saver.py:1399] Restoring parameters from /opt/models/wgs/model.ckpt INFO:tensorflow:Running local_init_op. I0826 20:44:34.790676 47737984214848 session_manager.py:531] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0826 20:44:34.816158 47737984214848 session_manager.py:534] Done running local_init_op. INFO:tensorflow:Reloading EMA... I0826 20:44:35.138201 47737984214848 modeling.py:418] Reloading EMA... INFO:tensorflow:Restoring parameters from /opt/models/wgs/model.ckpt I0826 20:44:35.138464 47737984214848 saver.py:1399] Restoring parameters from /opt/models/wgs/model.ckpt I0826 20:44:37.459590 47737984214848 call_variants.py:462] Processed 1 examples in 1 batches [849.953 sec per 100] I0826 20:46:44.255733 47737984214848 call_variants.py:462] Processed 50001 examples in 98 batches [0.271 sec per 100] I0826 20:48:48.666643 47737984214848 call_variants.py:462] Processed 100001 examples in 196 batches [0.260 sec per 100] I0826 20:50:53.094218 47737984214848 call_variants.py:462] Processed 150001 examples in 293 batches [0.256 sec per 100] I0826 20:52:58.984037 47737984214848 call_variants.py:462] Processed 200001 examples in 391 batches [0.255 sec per 100] I0826 20:55:03.618282 47737984214848 call_variants.py:462] Processed 250001 examples in 489 batches [0.254 sec per 100] I0826 20:57:06.583475 47737984214848 call_variants.py:462] Processed 300001 examples in 586 batches [0.253 sec per 100] I0826 20:59:10.820679 47737984214848 call_variants.py:462] Processed 350001 examples in 684 batches [0.252 sec per 100] I0826 21:01:15.474886 47737984214848 call_variants.py:462] Processed 400001 examples in 782 batches [0.252 sec per 100] I0826 21:03:18.836436 47737984214848 call_variants.py:462] Processed 450001 examples in 879 batches [0.251 sec per 100] I0826 21:05:24.652524 47737984214848 call_variants.py:462] Processed 500001 examples in 977 batches [0.251 sec per 100] I0826 21:07:30.681700 47737984214848 call_variants.py:462] Processed 550001 examples in 1075 batches [0.251 sec per 100] I0826 21:09:35.367410 47737984214848 call_variants.py:462] Processed 600001 examples in 1172 batches [0.251 sec per 100] I0826 21:11:41.218489 47737984214848 call_variants.py:462] Processed 650001 examples in 1270 batches [0.251 sec per 100] I0826 21:13:47.358545 47737984214848 call_variants.py:462] Processed 700001 examples in 1368 batches [0.251 sec per 100] I0826 21:15:52.436908 47737984214848 call_variants.py:462] Processed 750001 examples in 1465 batches [0.251 sec per 100] I0826 21:17:58.339728 47737984214848 call_variants.py:462] Processed 800001 examples in 1563 batches [0.251 sec per 100] I0826 21:20:07.519950 47737984214848 call_variants.py:462] Processed 850001 examples in 1661 batches [0.252 sec per 100] I0826 21:22:14.806241 47737984214848 call_variants.py:462] Processed 900001 examples in 1758 batches [0.252 sec per 100] I0826 21:24:23.524628 47737984214848 call_variants.py:462] Processed 950001 examples in 1856 batches [0.252 sec per 100] Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1380, in _do_call return fn(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1363, in _run_fn return self._call_tf_sessionrun(options, feed_dict, fetch_list, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1456, in _call_tf_sessionrun return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, tensorflow.python.framework.errors_impl.DataLossError: truncated record at 19179998357' failed with EOF reached [[{{node IteratorGetNext}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 513, in tf.compat.v1.app.run() File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/absl_py/absl/app.py", line 300, in run _run_main(main, args) File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/absl_py/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 494, in main call_variants( File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 453, in call_variants prediction = next(predictions) File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 642, in predict preds_evaluated = mon_sess.run(predictions) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 786, in run return self._sess.run( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1315, in run return self._sess.run( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1420, in run raise six.reraise(*original_exc_info) File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/six_archive/six.py", line 703, in reraise raise value File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1405, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1473, in run outputs = _WrappedSession.run( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/monitored_session.py", line 1236, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 970, in run result = self._run(None, fetches, feed_dict, options_ptr, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1193, in _run results = self._do_run(handle, final_targets, final_fetches, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1373, in _do_run return self._do_call(_run_fn, feeds, fetches, targets, options, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1399, in _do_call raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter tensorflow.python.framework.errors_impl.DataLossError: truncated record at 19179998357' failed with EOF reached [[node IteratorGetNext (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/util.py:60) ]]

Errors may have originated from an input operation. Input Source operations connected to node IteratorGetNext: In[0] IteratorV2 (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/util.py:58)

Operation defined at: (most recent call last)

File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 513, in tf.compat.v1.app.run()

File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/absl_py/absl/app.py", line 300, in run _run_main(main, args)

File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/absl_py/absl/app.py", line 251, in _run_main sys.exit(main(argv))

File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 494, in main call_variants(

File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 453, in call_variants prediction = next(predictions)

File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 621, in predict features, input_hooks = self._get_features_from_input_fn(

File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1019, in _get_features_from_input_fn result, _, hooks = estimator_util.parse_input_fn_result(result)

File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/util.py", line 60, in parse_input_fn_result result = iterator.get_next()

Original stack trace for 'IteratorGetNext': File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 513, in tf.compat.v1.app.run() File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/absl_py/absl/app.py", line 300, in run _run_main(main, args) File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/absl_py/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 494, in main call_variants( File "/tmp/pbs.1173981.omics/Bazel.runfiles_pfgek2w5/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 453, in call_variants prediction = next(predictions) File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 621, in predict features, input_hooks = self._get_features_from_input_fn( File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1019, in _get_features_from_input_fn result, _, hooks = estimator_util.parse_input_fn_result(result) File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/util.py", line 60, in parse_input_fn_result result = iterator.get_next() File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 444, in get_next flat_ret = gen_dataset_ops.iterator_get_next( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2865, in iterator_get_next _, _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3697, in _create_op_internal ret = Operation( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 2101, in init self._traceback = tf_stack.extract_stack_for_node(self._c_op)

real 41m45.880s user 1063m44.358s sys 25m21.900s INFO: Cleaning up image... ERROR: failed to delete container image tempDir /tmp/pbs.1173981.omics/rootfs-2853380811: unlinkat /tmp/pbs.1173981.omics/rootfs-2853380811/tmp-rootfs-1307439201/opt/traps/lib/libmodule64.so: permission denied singularity/3.10.0 is unloaded

  • `

Does the quick start test work on your system? Please test with https://github.com/google/deepvariant/blob/r0.10/docs/deepvariant-quick-start.md. Is there any way to reproduce the issue by using the quick start? No Any additional context: Input BAM file seems valid. Checked with samtools quickcheck -v command.

yangyxt avatar Sep 06 '22 07:09 yangyxt

@yangyxt how large is the bam file you are working with? Is it possible to connect to the singularity image while it is running and run samtools quickcheck -v?

danielecook avatar Sep 06 '22 18:09 danielecook

@yangyxt how large is the bam file you are working with? Is it possible to connect to the singularity image while it is running and run samtools quickcheck -v?

Thank you for the response and sorry for late notice. I tried and the samtools quickcheck returned success: image

yangyxt avatar Sep 19 '22 04:09 yangyxt

@yangyxt how large is the bam file you are working with? Is it possible to connect to the singularity image while it is running and run samtools quickcheck -v?

I found a thread in tensorflow github page, starting from 2017, lasting to 2021. I havent used tensorflow so I can't understand the thread. For your information only: https://github.com/tensorflow/tensorflow/issues/13463

yangyxt avatar Sep 19 '22 04:09 yangyxt

@yangyxt was this resolved? From the original error message, it seems to me that the input to call_variants was truncated. Which means that your make_examples run might have not been fully succeeded. Another possible issue is: If you happen to have multiple make_examples running and overwriting the same files, you also might have corrupted output from make_examples (which will cause the call_variants step to err out.)

pichuan avatar Sep 19 '22 17:09 pichuan

@pichuan Thanks for the response. I haven't resolved this. I do use GNU parallel to run 3 deep-variant docker images in parallel. But the input/output files for each process are different from each other. The only common directory that has parallel writing operations under it is the $TMPDIR or $SINGULARITY_CACHEDIR. Should I make the TMPDIR/SINGULARITY_CACHEDIR unique for each singularity run?

Here I show the original singularity command: `export SINGULARITY_CACHEDIR="/paedyl01/disk1/yangyxt/test_tmp"

singularity run \
-B "/paedyl01/disk1/yangyxt,/usr/lib/locale" \
--env LANG="en_US.UTF-8" \
--env LC_ALL="C" \
--env LANGUAGE="en_US.UTF-8" \
--env LC_CTYPE="UTF-8" \
--env TMPDIR="/paedyl01/disk1/yangyxt/test_tmp" \
--env SINGULARITY_CACHEDIR="/paedyl01/disk1/yangyxt/test_tmp" \
--home "/paedyl01/disk1/yangyxt/home:/home" \
--workdir /paedyl01/disk1/yangyxt \
--contain \
${container} \
/opt/deepvariant/bin/run_deepvariant \
--model_type=${model_type} \
--ref="${ref_genome}" \
--reads="${bam_file}" \
${region_arg} \
--output_vcf="${output_vcf}" \
--output_gvcf="${output_gvcf}" \
--intermediate_results_dir "/paedyl01/disk1/yangyxt/test_tmp" \
--num_shards=${threads}

yangyxt avatar Sep 21 '22 05:09 yangyxt

I'm not very familiar with SINGULARITY_CACHEDIR. But, in your command, if you're running it 3 times, you should use a different --intermediate_results_dir. Output of make_examples will be written to that directory. So, if you use the same intermediate_results_dir, that might explain why your data is corrupted.

pichuan avatar Sep 21 '22 05:09 pichuan

I'm not very familiar with SINGULARITY_CACHEDIR. But, in your command, if you're running it 3 times, you should use a different --intermediate_results_dir. Output of make_examples will be written to that directory. So, if you use the same intermediate_results_dir, that might explain why your data is corrupted.

Thank you. I will try and feedback to you.

yangyxt avatar Sep 21 '22 06:09 yangyxt

I tried to setup a temporary folder with random ID names for each run. And no more issue encountered. Thank you!

yangyxt avatar Oct 06 '22 15:10 yangyxt