How to install tfds on the workers when generating a custom beam dataset on Google Cloud Dataflow?
What I need help with / What I was wondering
I’m trying to prepare a custom beam dataset called librilight using TFDS on Google Cloud Dataflow. I followed instructions on tfds new and beam dataset, and was able to run the beam pipeline successfully with DirectRunner locally.
But it failed on Dataflow workers with error ModuleNotFoundError: No module named librilight where librilight is my custom dataset module name.
What I've tried so far
-
To tell Dataflow workers to install tfds with my custom dataset, I used my git repo. I have
echo "https://github.com/zhiyun/datasets/archive/librilight.tar.gz" > /tmp/beam_requirements.txt. However, the worker failed with module not found error. -
I have
save_main_sessionoption enabled in beam_pipeline_options, but it didn't help. -
I also tried to build a docker image, but it failed with the same module not found error.
Here is the full error log.
Traceback (most recent call last):
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/bin/tfds", line 8, in <module>
sys.exit(launch_cli())
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/main.py", line 104, in launch_cli
app.run(main, flags_parser=_parse_flags)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/main.py", line 99, in main
args.subparser_fn(args)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/build.py", line 233, in _build_datasets
_download_and_prepare(args, builder)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/build.py", line 435, in _download_and_prepare
builder.download_and_prepare(
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 600, in download_and_prepar
e
self._download_and_prepare(
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1405, in _download_and_prep
are
split_info_futures.append(future)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/split_builder.py", line 198, in maybe_beam_pipeline
self._beam_pipeline.__exit__(None, None, None)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/apache_beam/pipeline.py", line 598, in __exit__
self.result.wait_until_finish()
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1641, in wait_until_fin
ish
raise DataflowRuntimeException(
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
return dill.loads(s)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 826, in _import_module
return __import__(import_name)
ModuleNotFoundError: No module named 'librilight'
Here is my script
echo "https://github.com/zhiyun/datasets/archive/v4.8.0-branch.tar.gz" > /tmp/beam_requirements.txt
echo "wrapt" >> /tmp/beam_requirements.txt
echo "pydub" >> /tmp/beam_requirements.txt
tfds build tensorflow_datasets/datasets/librilight/ \
--manual_dir=${SOURCE_DIR} \
--data_dir=${DATA_DIR} \
--beam_pipeline_options=\
"runner=DataflowRunner,"\
"region=${REGION},"\
"project=${GCP_PROJECT},"\
"job_name=librilight-gen-${DATE},"\
"staging_location=gs://${TEMP_BUCKET}/binaries/,"\
"temp_location=gs://${TEMP_BUCKET}/tmp/,"\
"[service_account_email=ml-training@cybertron-gcp-island-test-0rxn.iam.gserviceaccount.com](mailto:service_account_email=ml-training@cybertron-gcp-island-test-0rxn.iam.gserviceaccount.com),"\
"network=cybertron-gcp-island-test-0rxn-usc1-island-vpc,"\
"subnetwork=https://www.googleapis.com/compute/v1/projects/gns-network-prod-0d38/regions/us-central1/subnetworks/cybertron-gcp-island-test-0rxn-usc1-priv-island,"\
"dataflow_service_options=enable_secure_boot,"\
"experiments=use_network_tags=allow-internet-egress,"\
"no_use_public_ips,"\
"requirements_file=/tmp/beam_requirements.txt,"\
"save_main_session"
My dockerfile
# syntax=docker/dockerfile:1
FROM apache/beam_python3.8_sdk:2.43.0
# Pre-built python dependencies
RUN pip install https://github.com/zhiyun/datasets/archive/librilight.tar.gz
RUN pip install wrapt
RUN pip install pydub
# Pre-built other dependencies
# RUN apt-get update \
# && apt-get dist-upgrade \
# && apt-get install -y --no-install-recommends ffmpeg
# Set the entrypoint to the Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
It would be nice if... tfds documentation could cover this case where we need to install tfds with custom dataset on Google Cloud Dataflow workers.
Environment information (if applicable)
- Operating System:
- Python version: python 3.8
-
tensorflow-datasets/tfds-nightlyversion: I have tried both tensorflow-dataset 4.8.1 and tfds-nightly. Both failed with the same errors. -
tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpuversion:
Hi @zhiyun, you could try to install the version of TFDS containing the definition of your dataset.
You will also need to follow the same naming pattern as other datasets (for instance, see this dataset).
So:
-
Follow the naming pattern defined in https://www.tensorflow.org/datasets/add_dataset#default_template_tfds_new. For instance,
librilight.pyshould belibrilight_dataset_builder. (We are in the process of migrating to this new standard, which is why it may not be reflected when you launchedtfds new.) -
Add this line in
/tmp/beam_requirements.txtto install TFDS from your version:
tensorflow_datasets@git+https://github.com/zhiyun/datasets@librilight
Then tfds.load('librilight') should work.
Please, tell me how it goes! Thank you.
Hi @marcenacp, thank you for the suggestions. I tried them but meeting the following errors...
- re:
librilight_dataset_builder. I find thattfds buildwould recognize${mydataset}.pyinstead of${mydataset}_dataset_builder. I'm not sure if there are other places in the code that rely on the${mydataset}_dataset_builderpattern that i need to be careful with. https://github.com/tensorflow/datasets/blob/e58a16fd11c496f251f4c0fea38914c6cb6d6050/tensorflow_datasets/scripts/cli/build.py#L351-L358 - using
echo tensorflow_datasets@git+https://github.com/zhiyun/datasets@librilight > /tmp/beam_requirements.txt, i met the error"ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?". So currently i useecho "https://github.com/zhiyun/datasets/archive/librilight.tar.gz" > /tmp/beam_requirements.txt(so that the worker should callpip install https://github.com/zhiyun/datasets/archive/librilight.tar.gz) but it ended up with the ModuleNotFound error..
Any suggestions and comments are appreciated! Thanks again.
- Seems like a bug, I'll send a fix today.
tfds-nightly and tensorflow-datasets 4.8.2 have been released with the fixes for issue mentioned in 1.
To launch TFDS on Dataflow with a local version of your code, you can use the following Dockerfile:
FROM apache/beam_python3.8_sdk:2.44.0
COPY . /extra_packages
ARG EXTRA_PACKAGE
RUN if [ -n "${EXTRA_PACKAGE}" ]; \
then pip install /extra_packages/${EXTRA_PACKAGE}[tests-all]; \
else pip install tfds-nightly; \
fi
# Set the entrypoint to the Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
- Package your local version of TFDS with
python setup.py bdist. - Move
dist/*.tar.gzto current folder to have it in context. - Build with:
docker build --build-arg EXTRA_PACKAGE=tfds.tar.gz -t ${DOCKER_NAME} .