datasets apache beam job got killed when download_and_prepare big

Description I was trying to download the big_patent dataset using the following command: python3 -m tensorflow_datasets.scripts.download_and_prepare --datasets='big_patent' It was able to finish downloading and extracting the raw data, however, when it was preparing the tfds, it always stopped at some point without any error but a "Killed" message. The output of the command is as follows:

`...... Extraction completed...: 0 file [00:00, ? file/s] I0702 17:20:45.057484 139989019965248 dataset_builder.py:947] Generating split train I0702 17:20:45.214179 139989019965248 dataset_builder.py:947] Generating split validation I0702 17:20:45.305060 139989019965248 dataset_builder.py:947] Generating split test I0702 17:20:45.603289 139989019965248 translations.py:574] ==================== <function annotate_downstream_side_inputs at 0x7f5105097e60> ==================== I0702 17:20:45.604220 139989019965248 translations.py:574] ==================== <function fix_side_input_pcoll_coders at 0x7f5105097f80> ==================== I0702 17:20:45.604702 139989019965248 translations.py:574] ==================== <function lift_combiners at 0x7f5105095050> ==================== I0702 17:20:45.605427 139989019965248 translations.py:574] ==================== <function expand_sdf at 0x7f51050950e0> ==================== I0702 17:20:45.607081 139989019965248 translations.py:574] ==================== <function expand_gbk at 0x7f5105095170> ==================== I0702 17:20:45.608013 139989019965248 translations.py:574] ==================== <function sink_flattens at 0x7f5105095290> ==================== I0702 17:20:45.608685 139989019965248 translations.py:574] ==================== <function greedily_fuse at 0x7f5105095320> ==================== I0702 17:20:45.610882 139989019965248 translations.py:574] ==================== <function read_to_impulse at 0x7f51050953b0> ==================== I0702 17:20:45.611124 139989019965248 translations.py:574] ==================== <function impulse_to_input at 0x7f5105095440> ==================== I0702 17:20:45.611404 139989019965248 translations.py:574] ==================== <function sort_stages at 0x7f5105095680> ==================== I0702 17:20:45.611575 139989019965248 translations.py:574] ==================== <function setup_timer_mapping at 0x7f51050955f0> ==================== I0702 17:20:45.612018 139989019965248 translations.py:574] ==================== <function populate_data_channel_coders at 0x7f5105095710> ==================== I0702 17:20:45.616966 139989019965248 statecache.py:154] Creating state cache with size 100 I0702 17:20:45.617365 139989019965248 worker_handlers.py:853] Created Worker handler <apache_beam.runners.portability.fn_api_runner.worker_handlers.EmbeddedWorkerHandler object at 0x7f5105073e10> for envi ronment urn: "beam:env:embedded_python:v1"

I0702 17:20:45.617575 139989019965248 fn_runner.py:474] Running (((ref_AppliedPTransform_test/ReadTextIO/Read/_SDFBoundedSourceWrapper/Impulse_102)+(test/ReadTextIO/Read/_SDFBoundedSourceWrapper/ParDo(SDF BoundedSourceDoFn)/PairWithRestriction))+(test/ReadTextIO/Read/_SDFBoundedSourceWrapper/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction))+(ref_PCollection_PCollection_67_split/Write) I0702 17:20:45.694091 139989019965248 fn_runner.py:474] Running (((((ref_PCollection_PCollection_67_split/Read)+(test/ReadTextIO/Read/_SDFBoundedSourceWrapper/ParDo(SDFBoundedSourceDoFn)/Process))+(ref_Ap pliedPTransform_test/FlatMap(_process_example)_104))+(ref_AppliedPTransform_test/Encode_105))+(ref_AppliedPTransform_test/SerializeBucketize_106))+(test/GroupByBucket/Write) I0702 17:22:48.813517 139989019965248 fn_runner.py:474] Running (((ref_AppliedPTransform_validation/ReadTextIO/Read/_SDFBoundedSourceWrapper/Impulse_54)+(validation/ReadTextIO/Read/_SDFBoundedSourceWrappe r/ParDo(SDFBoundedSourceDoFn)/PairWithRestriction))+(validation/ReadTextIO/Read/_SDFBoundedSourceWrapper/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction))+(ref_PCollection_PCollection_34_split/Write) I0702 17:22:48.911520 139989019965248 fn_runner.py:474] Running (((ref_AppliedPTransform_train/ReadTextIO/Read/_SDFBoundedSourceWrapper/Impulse_6)+(train/ReadTextIO/Read/_SDFBoundedSourceWrapper/ParDo(SDF BoundedSourceDoFn)/PairWithRestriction))+(train/ReadTextIO/Read/_SDFBoundedSourceWrapper/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction))+(ref_PCollection_PCollection_1_split/Write) I0702 17:22:49.008244 139989019965248 fn_runner.py:474] Running (((((ref_PCollection_PCollection_1_split/Read)+(train/ReadTextIO/Read/_SDFBoundedSourceWrapper/ParDo(SDFBoundedSourceDoFn)/Process))+(ref_Ap pliedPTransform_train/FlatMap(_process_example)_8))+(ref_AppliedPTransform_train/Encode_9))+(ref_AppliedPTransform_train/SerializeBucketize_10))+(train/GroupByBucket/Write) Killed `

What I've tried so far I guess it was OOM. So I tried to set the num_workers smaller (I tried 8, 4, 1) by adding ' --beam_pipeline_options=num_workers=1', but it didn't help.

Need Help Is there anyway to limit memory usage? Or any other methods to download big_patent tfds? Thanks!

Environment information (if applicable)

Operating System: Ubuntu 18.04.3 LTS
Memory size: 64GB
Python version: python 3.7
tensorflow==1.15.3
tensorflow-datasets==3.1.0
tfds-nightly==3.1.0.dev202007010106
tensorflow-gpu==1.15.2

Jul 03 '20 02:07 jieralice13

To generate dataset with beam, you should try to follow our instructions: https://www.tensorflow.org/datasets/beam_datasets#on_google_cloud_dataflow

Likely you'll have to use Dataflow or similar.

Jul 06 '20 23:07 Conchylicultor

I'm also having this problem when generating the librispeech dataset locally. It used up ~20GB of memory before I had to kill it. However, I don't understand why it needs so much memory. It seems the tfrecord files it generates are around 100mb each, so why does it need to keep 200x that amount in memory before it writes it to disk?

This is a separate issue but it also has really low CPU and disk utilisation regardless of what I set the direct_num_workers beam option to.

Jan 07 '21 02:01 cameron-martin

According to the docs, there are also these options:

max_num_workers
number_of_worker_harness_threads
num_workers

Apr 16 '22 12:04 albertz

@jieralice13 hi! We do not recommend building large datasets with the default beam runner. You could try building it locally for example with Flink runner (https://www.tensorflow.org/datasets/beam_datasets#with_apache_flink), but otherwise I would recommend using Dataflow or similar for distributed pipeline running.

@cameron-martin hi! I hope this commit fixes the issue for you: https://github.com/tensorflow/datasets/commit/abde5dd710b63b9eb8fa896e2f100b68caf4391e

Dec 14 '22 11:12 fineguy

apache beam job got killed when download_and_prepare big_patent dataset