[BUG] GCP deployments require a boto.cfg file in order to execute workflows with GCS storage
Describe the bug
Deployments to GCP using GCS for storage and GKE (with workload identity) (as described here) do not work due to a missing file.
On pod init, it gets the following error:
$ k -n development logs f75075fdeda774b358b4-n0-0
{"asctime": "2022-07-28 14:43:17,168", "name": "flytekit", "levelname": "ERROR", "message": "Error from command '['gsutil', 'cp', 'gs://flyte-ts-temp-service-flyte/t2/flytesnacks/development/2GTQFPWGXQNYTUULNF2SODVCXM======/scriptmode.tar.gz', '/root']':\nb'ServiceException: 401 Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.\\n'\n"}
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/flytekit/core/data_persistence.py", line 423, in get_data
data_persistence_plugin(data_config=self.data_config).get(
File "/usr/local/lib/python3.8/site-packages/flytekit/extras/persistence/gcs_gsutil.py", line 89, in get
return _update_cmd_config_and_execute(cmd)
File "/usr/local/lib/python3.8/site-packages/flytekit/extras/persistence/gcs_gsutil.py", line 14, in _update_cmd_config_and_execute
return subprocess.check_call(cmd, env=env)
File "/usr/local/lib/python3.8/site-packages/flytekit/tools/subprocess.py", line 26, in check_call
raise Exception(
Exception: Called process exited with error code: 1. Stderr dump:
b'ServiceException: 401 Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.\n'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/pyflyte-fast-execute", line 8, in <module>
sys.exit(fast_execute_task_cmd())
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/flytekit/bin/entrypoint.py", line 495, in fast_execute_task_cmd
_download_distribution(additional_distribution, dest_dir)
File "/usr/local/lib/python3.8/site-packages/flytekit/tools/fast_registration.py", line 94, in download_distribution
file_access.get_data(additional_distribution, destination)
File "/usr/local/lib/python3.8/site-packages/flytekit/core/data_persistence.py", line 427, in get_data
raise FlyteAssertion(
flytekit.exceptions.user.FlyteAssertion: Failed to get data from gs://flyte-ts-temp-service-flyte/t2/flytesnacks/development/2GTQFPWGXQNYTUULNF2SODVCXM======/scriptmode.tar.gz to /root (recursive=False).
Original exception: Called process exited with error code: 1. Stderr dump:
b'ServiceException: 401 Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.\n'
Following this slack thread, it appears this is due to a missing configuration for gsutil the Dockerfile requires. This is related to standalone gsutil not being able to authenticate without additional config.
See this Dockerfile. It runs:
RUN pip install gsutil
but it is missing the boto.cfg configuration required for GKE workload identity to work.
In order to solve this, Soren Brunk on Slack suggested adding the following to the Dockerfile and running pyflyte with a custom derived image:
FROM ghcr.io/flyteorg/flytekit:py3.8-1.0.3
# Required for gsutil to work with workload-identity
RUN echo '[GoogleCompute]\nservice_account = default' > /etc/boto.cfg
That last line adds the configuration required for the workload identity to work.
This is only needed for GCP-specific deployments. So a solution should only apply for that environment.
Expected behavior
The pods should start up and not error with a 401 error.
Additional context to reproduce
- Follow GCP manual deployment directions (or Opta deployment)
- Run a workflow using pyflyte.
Screenshots
Slack reference: This thread post and below.
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
@samhita-alla @SmritiSatyanV - do you think you can find a place to add this to the docs somewhere? We should also figure out how to support this in the default gcs case.
Also, i think @kumare3 found that if you use the fsspec plugin, it gets around this issue.
Sure @wild-endeavor , will take a look at this.