datasets icon indicating copy to clipboard operation
datasets copied to clipboard

How to install tfds on the workers when generating a custom beam dataset on Google Cloud Dataflow?

Open zhiyun opened this issue 3 years ago • 5 comments

What I need help with / What I was wondering I’m trying to prepare a custom beam dataset called librilight using TFDS on Google Cloud Dataflow. I followed instructions on tfds new and beam dataset, and was able to run the beam pipeline successfully with DirectRunner locally.

But it failed on Dataflow workers with error ModuleNotFoundError: No module named librilight where librilight is my custom dataset module name.

What I've tried so far

  • To tell Dataflow workers to install tfds with my custom dataset, I used my git repo. I have echo "https://github.com/zhiyun/datasets/archive/librilight.tar.gz" > /tmp/beam_requirements.txt. However, the worker failed with module not found error.

  • I have save_main_session option enabled in beam_pipeline_options, but it didn't help.

  • I also tried to build a docker image, but it failed with the same module not found error.

Here is the full error log.

Traceback (most recent call last):                                                                                                                      

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/bin/tfds", line 8, in <module>                                                                            

    sys.exit(launch_cli())                                                                                                                              

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/main.py", line 104, in launch_cli             

    app.run(main, flags_parser=_parse_flags)                                                                                                            

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/absl/app.py", line 308, in run                                                

    _run_main(main, args)                                                                                                                               

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main                                          

    sys.exit(main(argv))                                                                                                                                

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/main.py", line 99, in main                    

    args.subparser_fn(args)                                                                                                                             

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/build.py", line 233, in _build_datasets       

    _download_and_prepare(args, builder)                                                                                                                

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/build.py", line 435, in _download_and_prepare 

    builder.download_and_prepare(                                                                                                                       

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 600, in download_and_prepar

e                                                                                                                                                       

    self._download_and_prepare(                                                                                                                         

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1405, in _download_and_prep

are                                                                                                                                                     

    split_info_futures.append(future)                                                                                                                   

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/contextlib.py", line 120, in __exit__                                                       

    next(self.gen)                                                                                                                                      

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/split_builder.py", line 198, in maybe_beam_pipeline  

    self._beam_pipeline.__exit__(None, None, None)                                                                                                      

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/apache_beam/pipeline.py", line 598, in __exit__                               

    self.result.wait_until_finish()                                                                                                                     

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1641, in wait_until_fin

ish                                                                                                                                                     

    raise DataflowRuntimeException(                                                                                                                     

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:                                  

Traceback (most recent call last):                                                                                                                      

  File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads                                                

    return dill.loads(s)                                                                                                                                

  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads                                                                       

    return load(file, ignore, **kwds)                                                                                                                   

  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load                                                                        

    return Unpickler(file, ignore=ignore, **kwds).load()                                                                                                

  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load                                                                        

    obj = StockUnpickler.load(self)                                                                                                                     

  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 826, in _import_module                                                              

    return __import__(import_name)                                                                                                                      

ModuleNotFoundError: No module named 'librilight' 

Here is my script

echo "https://github.com/zhiyun/datasets/archive/v4.8.0-branch.tar.gz" > /tmp/beam_requirements.txt
echo "wrapt" >> /tmp/beam_requirements.txt
echo "pydub" >> /tmp/beam_requirements.txt

tfds build tensorflow_datasets/datasets/librilight/ \
--manual_dir=${SOURCE_DIR} \
--data_dir=${DATA_DIR} \
--beam_pipeline_options=\
"runner=DataflowRunner,"\
"region=${REGION},"\
"project=${GCP_PROJECT},"\
"job_name=librilight-gen-${DATE},"\
"staging_location=gs://${TEMP_BUCKET}/binaries/,"\
"temp_location=gs://${TEMP_BUCKET}/tmp/,"\
"[service_account_email=ml-training@cybertron-gcp-island-test-0rxn.iam.gserviceaccount.com](mailto:service_account_email=ml-training@cybertron-gcp-island-test-0rxn.iam.gserviceaccount.com),"\
"network=cybertron-gcp-island-test-0rxn-usc1-island-vpc,"\
"subnetwork=https://www.googleapis.com/compute/v1/projects/gns-network-prod-0d38/regions/us-central1/subnetworks/cybertron-gcp-island-test-0rxn-usc1-priv-island,"\
"dataflow_service_options=enable_secure_boot,"\
"experiments=use_network_tags=allow-internet-egress,"\
"no_use_public_ips,"\
"requirements_file=/tmp/beam_requirements.txt,"\
"save_main_session"

My dockerfile

# syntax=docker/dockerfile:1

FROM apache/beam_python3.8_sdk:2.43.0

# Pre-built python dependencies
RUN pip install https://github.com/zhiyun/datasets/archive/librilight.tar.gz
RUN pip install wrapt
RUN pip install pydub

# Pre-built other dependencies
# RUN apt-get update \
#  && apt-get dist-upgrade \
#  && apt-get install -y --no-install-recommends ffmpeg

# Set the entrypoint to the Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

It would be nice if... tfds documentation could cover this case where we need to install tfds with custom dataset on Google Cloud Dataflow workers.

Environment information (if applicable)

  • Operating System:
  • Python version: python 3.8
  • tensorflow-datasets/tfds-nightly version: I have tried both tensorflow-dataset 4.8.1 and tfds-nightly. Both failed with the same errors.
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version:

zhiyun avatar Jan 09 '23 06:01 zhiyun

Hi @zhiyun, you could try to install the version of TFDS containing the definition of your dataset.

You will also need to follow the same naming pattern as other datasets (for instance, see this dataset).

So:

  1. Follow the naming pattern defined in https://www.tensorflow.org/datasets/add_dataset#default_template_tfds_new. For instance, librilight.py should be librilight_dataset_builder. (We are in the process of migrating to this new standard, which is why it may not be reflected when you launched tfds new.)

  2. Add this line in /tmp/beam_requirements.txt to install TFDS from your version:

tensorflow_datasets@git+https://github.com/zhiyun/datasets@librilight

Then tfds.load('librilight') should work.

Please, tell me how it goes! Thank you.

marcenacp avatar Jan 10 '23 13:01 marcenacp

Hi @marcenacp, thank you for the suggestions. I tried them but meeting the following errors...

  1. re: librilight_dataset_builder. I find that tfds build would recognize ${mydataset}.py instead of ${mydataset}_dataset_builder. I'm not sure if there are other places in the code that rely on the ${mydataset}_dataset_builder pattern that i need to be careful with. https://github.com/tensorflow/datasets/blob/e58a16fd11c496f251f4c0fea38914c6cb6d6050/tensorflow_datasets/scripts/cli/build.py#L351-L358
  2. using echo tensorflow_datasets@git+https://github.com/zhiyun/datasets@librilight > /tmp/beam_requirements.txt, i met the error "ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?". So currently i use echo "https://github.com/zhiyun/datasets/archive/librilight.tar.gz" > /tmp/beam_requirements.txt (so that the worker should call pip install https://github.com/zhiyun/datasets/archive/librilight.tar.gz) but it ended up with the ModuleNotFound error..

Any suggestions and comments are appreciated! Thanks again.

zhiyun avatar Jan 11 '23 18:01 zhiyun

  1. Seems like a bug, I'll send a fix today.

pierrot0 avatar Jan 13 '23 09:01 pierrot0

tfds-nightly and tensorflow-datasets 4.8.2 have been released with the fixes for issue mentioned in 1.

pierrot0 avatar Jan 18 '23 06:01 pierrot0

To launch TFDS on Dataflow with a local version of your code, you can use the following Dockerfile:

FROM apache/beam_python3.8_sdk:2.44.0

COPY . /extra_packages

ARG EXTRA_PACKAGE
RUN if [ -n "${EXTRA_PACKAGE}" ]; \
  then pip install /extra_packages/${EXTRA_PACKAGE}[tests-all]; \
  else pip install tfds-nightly; \
  fi

# Set the entrypoint to the Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
  • Package your local version of TFDS with python setup.py bdist.
  • Move dist/*.tar.gz to current folder to have it in context.
  • Build with: docker build --build-arg EXTRA_PACKAGE=tfds.tar.gz -t ${DOCKER_NAME} .

marcenacp avatar Feb 16 '23 16:02 marcenacp