Slim versions of TFX Docker images
Could we have Docker images that are slimmer? Some examples of TFX Docker image sizes (compressed, even):
TFX 1.0: 5.67GB TFX 1.5: 6.65GB TFX 1.10: 8.53GB TFX 1.15: 11.4GB
At least an explanation why the image sizes keep on growing would be great.
Or is the recommended way to build a Docker image yourself off a slim Python or Ubuntu image?
Here's what I'm seeing when I build them by hand:
beamplustensorflow latest 3651a24a1b2e 4 minutes ago 4.47GB
gcr.io/<myproject>/us-central1/beamtensorflowtfx v2.56 c2c1a8363e4f 3 hours ago 8.63GB
justbeambieber latest 925d69528ea4 7 hours ago 182MB
If that TFX image is based on the latest (1.16dev) image, then that is quite a saving, almost half. Interesting. Did you find it hard to build, @pritamdodeja ?
And: justbeambieber is for sure the funniest name I've seen for a Docker image, ever!
The build wasn't so hard, I've included it below. The other images are just subsets of what's below. To your point about the tfx image, in my mind, I see tfx as the control plane and beam/tensorflow as the data plane, and as such, I'd imagine that the control plane doesn't add as much heft to this. The reason I'm going down this rabbit hole is I have beam/tfx code that runs on DirectRunner and embedded as Docker with beam, that doesn't run on DataflowRunner. I need to understand more about how the tfx image itself plays into the overall ecosystem of Vertex, TFX, Kubeflow, and Beam.
Would appreciate any advice.
FROM apache/beam_python3.10_sdk:2.56.0 AS build_base
RUN groupadd -g 1000 pdodeja && useradd -m -u 1000 -g pdodeja pdodeja
RUN chown -R pdodeja:pdodeja /opt
# Use the official Python 3.10.14 image as the base
FROM python:3.10.14-slim
# Create non-root user
RUN groupadd -g 1000 pdodeja && useradd -m -u 1000 -g pdodeja pdodeja
RUN chown -R pdodeja:pdodeja /opt
RUN chown -R pdodeja:pdodeja /usr/local/lib
RUN chown -R pdodeja:pdodeja /usr/local/bin/python
USER pdodeja
COPY --from=build_base /opt /opt
# RUN chown -R pdodeja:pdodeja /opt
USER root
RUN echo 'deb http://deb.debian.org/debian testing main' >> /etc/apt/sources.list \
&& apt-get update && apt-get install --no-install-recommends -y gcc g++
RUN apt-get install -y build-essential
RUN pip install numpy==1.26.4
RUN pip install tensorflow==2.15.0
RUN pip install tfx==1.15.1
RUN pip install apache_beam[gcp]==2.56.0
RUN chown -R pdodeja:pdodeja /usr/local/lib
ENTRYPOINT ["/opt/apache/beam/boot"]
@axeltidemann can you please clarify about your particular use-case? It would help me understand this better. For example, there's the container that's executing each step in the Kubeflow pipeline, there's the beam container, there's the container that needs both CUDA and beam (e.g. transform component), there's the trainer and tuner component, which I imagine needs CUDA. I'm trying to understand the meaning/purpose of the tfx container itself. Appreciate your feedback!
One reason (of many) that these large images are problematic is that GCP DataFlow jobs take forever to spin up new workers - anywhere from 15 to 30 minutes, in my experience!
I initially thought this might be due to lengthy dependency installation on worker startup (as described here), but I've confirmed that my dependencies are pre-installed in my custom docker image (based on tensorflow/tfx:1.15.1), and I am setting the --sdk_container_image={PIPELINE_IMAGE_URI} beam argument correctly.
DataFlow system logs confirm the long duration of the image pull:
Pulled image "<redacted>" with image id "sha256:<redacted>", repo tag "<redacted>", repo digest "<redacted>@sha256:<redacted>", size "12284987784" in 21m22.272733249s"
^ This happens for every worker that DataFlow spins up, which makes these jobs very slow to scale.
Anything that can be done to reduce the size of this 12.3GB image would be super helpful!
@stefandominicus-takealot @axeltidemann Can you see if the below works to reduce the size of the container and meet your objectives? I've used the below and it has been working for me. Size appears to be ~6GB. My transform job in Dataflow kicked off by Vertex ran successfully, Dataflow does show an "insight", I believe this has to do with the transform code whl being installed at runtime. Tensorflow is now including CUDA, so I believe nothing extra needs to be done for that anymore, I need to verify this part of it by comparing against local runs.
FROM apache/beam_python3.10_sdk:2.60.0 AS build_base
FROM python:3.10.14-slim
COPY --from=build_base /opt /opt
USER root
RUN echo 'deb http://deb.debian.org/debian testing main' >> /etc/apt/sources.list \
&& apt-get update && apt-get install --no-install-recommends -y gcc g++
RUN apt-get install -y build-essential
RUN pip install numpy==1.26.4
RUN pip install tensorflow==2.15.0
RUN pip install tfx==1.15.1
RUN pip install apache_beam[gcp]==2.60.0
RUN pip install kfp==2.8.0
RUN pip install keras-cv==0.9.0
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
ENTRYPOINT ["/opt/apache/beam/boot"]
You can build it like this:
export IMAGE_NAME="beam260tf215tfx151"
docker build . -t "${IMAGE_NAME}" # Have content above in Dockerfile in current directory
docker tag beam260tf215tfx151:latest us-east1-docker.pkg.dev/<project>/<repo>/beam260tf215tft151:v2.60
and you can push it like this:
docker push us-east1-docker.pkg.dev/<project>/<repo>/beam260tf215tft151:v2.60
Dear Users, Thank you for showing interest. We want to inform you that we are currently not picking this activity. We truly appreciate your contribution to help us improve. Your input is valuable! Thank you!
@stefandominicus-takealot was the docker repo in the same region as where the pipeline was executing? Let me know if the above suggestion solves your issue.
@janasangeetha I will try and make progress on this and track here. It might result in some documentation updates or some commits to the repo. I believe this should be a solvable issue. My thinking is the container should be such that it can execute all steps of a tfx pipeline locally and in GCP. Beam 2.61 just came out. My 2.60 containers are working without issue in GCP, I'll look at how the tfx container is built and test against pipelines that I know are working.
@stefandominicus-takealot I tested this with a beam 2.61 container using a Dockerfile like the above, and my container is retrieved in 2 minutes 51 seconds. The tfx container you're using is 4.27x the size of the custom container I'm using. The only job with an insight about container image pre-building is the transform one.
I spent some more time understanding this in more detail, and I'm going to summarize my understanding, what I've tried, and what the situation is per my understanding.
Requirements:
TFX should work (e.g. various runners can execute components via python -m approach) Beam/Dataflow should work (e.g. /opt/apache/beam/boot exists and is the entrypoint, /usr/local/bin/pip points to pip beam should use) GPU should work (e.g. tf.config.list_physical_devices() should show GPU)
Testing (local):
In my mind, if a pipeline works with docker embed mode for beam, that pipeline should work in Dataflow as in both these cases the beam container is executed.
I can locally (fedora 41 + docker) bring up/build these containers. For example, I have a 13GB container that I can get into, that uses pyenv to build the python runtime, uses nvidia base image, has apache_beam[gcp] at 2.61.0, tfx at 1.15.0. I can likely integrate the requirements.txt from the official container and see what it balloons up to. Speaking of the tfx container, with a few modifications of requirements.txt, I can build a tfx1.16dev image of the docker container inline with whats in tfx/tools/docker/build_docker_image.sh (it's about 25G). In local testing via gcloud ai custom-jobs local-run --executor-image, I can verify the GPU is seen, I can go in and checkout /opt/apache/beam/boot, tf, tfx etc., however, in GCP, these containers don't work. I'm seeing errors like
failed to create prepare snapshot dir: failed to create temp dir: mkdir /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/new-1207085269: no space left on device"
Which I'm going to try and address by giving it more space. What is surprising is that bigger containers have successfully executed Dataflow steps.
I believe we need better requirements/documentation/tests for the docker container. Secondly, I believe the container, as setup, is doing too much (e.g. it needs to be both tfx + tf + cuda + apache_beam, I've read there's issues with conda and beam, but some of the components are conda based, I see differences across v1.14 and v1.15). The container build is also very complex (e.g. wheel-builder building the local tfx and ml pipelines sdk) then installing these). I understand it makes sense from a CICD perspective, but the container that ships should be simpler in my opinion, and maybe that container can be a separate one than the one that builds the wheels and installs them and tests against pipelines etc.
In any case, the upstream dependency/opaqueness of the deep learning containers coupled with some of the other complexity makes this tricky to solve. It would be great if we could establish a contract of sorts that these containers can be verified against. For example, using gcloud ai custom-jobs local-run where it verifies the container meets the contract locally or via cloud shell so that container troubleshooting is not as complex.
Lastly, I believe these containers working in GCP for a particular release is critical as large scale training using tfx is dependent on it. For example, the 1.15 container does not work, and I've seen issues where people are staying back on 1.14 because of this issue.
I'll post my updates here, I appreciate any guidance and direction that others can provide.
Here's a container that I built that appears to be working in terms of tfx, tf, beam, and cuda. The last point I'm indirectly inferring because the tensorboard profiling I'm doing appears to be producing data.
FROM apache/beam_python3.10_sdk:2.61.0 AS build_base
FROM python:3.10.14-slim
COPY --from=build_base /opt /opt
USER root
RUN echo 'deb http://deb.debian.org/debian testing main' >> /etc/apt/sources.list \
&& apt-get update && apt-get install --no-install-recommends -y gcc g++
RUN apt-get install -y build-essential
RUN pip install numpy==1.26.4
RUN pip install 'tensorflow[and-cuda]==2.15.1'
RUN pip install tfx==1.15.1
RUN pip install apache_beam[gcp]==2.61.0
RUN pip install kfp==2.8.0
RUN pip install keras-cv==0.9.0
RUN cd $(dirname $(python -c 'print(__import__("tensorflow").__file__)')) && ln -svf ../nvidia/*/lib/*.so* . && cd -
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
ENTRYPOINT ["/opt/apache/beam/boot"]
The container appears to be a little over 9GB, so it loads fairly quickly. The biggest contributor to this is the 7GB layer created when tensorflow is installed with cuda. It includes the ml-pipelines-sdk as well. I'm not seeing any of the issues in terms of layers not loading etc., that I was seeing before.
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.
Hello! :wave:
We've created a Docker image that significantly reduces the size compared to standard TFX Docker images. Image Size:
- Compressed: 1.2 GB
- Decompressed: 3.7 GB
It has been tested successfully on Vertex-ai pipeline.
Here is the Dockerfile:
FROM python:3.10
# see https://pypi.org/project/tfx/ for package compatibility
RUN pip install --upgrade --no-cache-dir pip \
&& pip install --upgrade --no-cache-dir apache-beam[gcp]==2.56.0 \
&& pip install --upgrade --no-cache-dir tfx[kfp]==1.15.1
# Copy files from the official Apache Beam SDK image
# see https://cloud.google.com/dataflow/docs/guides/build-container-image for more information
COPY --from=apache/beam_python3.10_sdk:2.56.0 /opt/apache/beam /opt/apache/beam
# Set the entrypoint to Apache Beam SDK launcher
ENTRYPOINT ["/opt/apache/beam/boot"]
Excellent work, @KholofeloPhahlamohlaka-TAL ! (Full disclosure: we work together, but I think he deserves praise on the world wide web as well!)
Hi @KholofeloPhahlamohlaka-TAL We appreciate your efforts in building the slimmer version of TFX docker image. I will check with the team internally on next steps and provide an update here. Thank you!
We would be very interested in anyone's experience building TFX images with Nvidia GPU support. In our experience, this can easily double the size, and it's not easy to get TF/TFX/CUDA etc versions to align and be 'found' by TF
We would be very interested in anyone's experience building TFX images with Nvidia GPU support. In our experience, this can easily double the size, and it's not easy to get TF/TFX/CUDA etc versions to align and be 'found' by TF
@adriangay can you see if the solution I provided above on Dec 6th above solves your issue? I still need to figure out on the TPU side, but I believe it should address the NVIDIA GPU + tensorflow (2.15.1) + tfx (1.15.1) situation.
I ran across this in something we were doing for a customer, and it was very strange. Not sure what to make of it. I built a docker container locally, verified that GPU was being seen through tf.config.list_physical_devices(). I pushed the image to container registry in GCP, and then in my training job, specify that I want to attach a web console. While training is happening, I get to the console and run ipython + import tensorflow and then look for GPUs, and I don't find them. Does anybody have any idea how to debug this? My docker file is as follows:
FROM apache/beam_python3.10_sdk:2.59.0 AS build_base
FROM python:3.10.14-slim
COPY --from=build_base /opt /opt
USER root
RUN echo 'deb http://deb.debian.org/debian testing main' >> /etc/apt/sources.list \
&& apt-get update && apt-get install --no-install-recommends -y gcc g++
RUN apt-get install -y build-essential
RUN pip install numpy==1.26.4
RUN pip install 'tensorflow[and-cuda]==2.15.1'
RUN pip install tfx==1.15.1
RUN pip install apache_beam[gcp]==2.59.0
RUN pip install kfp==2.8.0
RUN pip install keras-cv==0.9.0
RUN pip install pycocotools==2.0.8
RUN cd $(dirname $(python -c 'print(__import__("tensorflow").__file__)')) && ln -svf ../nvidia/*/lib/*.so* . && cd -
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
ENTRYPOINT ["/opt/apache/beam/boot"]