flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] GCP deployments require a boto.cfg file in order to execute workflows with GCS storage

Open tszumowski opened this issue 3 years ago • 2 comments

Describe the bug

Deployments to GCP using GCS for storage and GKE (with workload identity) (as described here) do not work due to a missing file.

On pod init, it gets the following error:

$ k -n development logs f75075fdeda774b358b4-n0-0
{"asctime": "2022-07-28 14:43:17,168", "name": "flytekit", "levelname": "ERROR", "message": "Error from command '['gsutil', 'cp', 'gs://flyte-ts-temp-service-flyte/t2/flytesnacks/development/2GTQFPWGXQNYTUULNF2SODVCXM======/scriptmode.tar.gz', '/root']':\nb'ServiceException: 401 Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.\\n'\n"}
Traceback (most recent call last):
 File "/usr/local/lib/python3.8/site-packages/flytekit/core/data_persistence.py", line 423, in get_data
  data_persistence_plugin(data_config=self.data_config).get(
 File "/usr/local/lib/python3.8/site-packages/flytekit/extras/persistence/gcs_gsutil.py", line 89, in get
  return _update_cmd_config_and_execute(cmd)
 File "/usr/local/lib/python3.8/site-packages/flytekit/extras/persistence/gcs_gsutil.py", line 14, in _update_cmd_config_and_execute
  return subprocess.check_call(cmd, env=env)
 File "/usr/local/lib/python3.8/site-packages/flytekit/tools/subprocess.py", line 26, in check_call
  raise Exception(
Exception: Called process exited with error code: 1. Stderr dump:
​
b'ServiceException: 401 Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.\n'
​
During handling of the above exception, another exception occurred:
​
Traceback (most recent call last):
 File "/usr/local/bin/pyflyte-fast-execute", line 8, in <module>
  sys.exit(fast_execute_task_cmd())
 File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
  return self.main(*args, **kwargs)
 File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
  rv = self.invoke(ctx)
 File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
  return ctx.invoke(self.callback, **ctx.params)
 File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
  return __callback(*args, **kwargs)
 File "/usr/local/lib/python3.8/site-packages/flytekit/bin/entrypoint.py", line 495, in fast_execute_task_cmd
  _download_distribution(additional_distribution, dest_dir)
 File "/usr/local/lib/python3.8/site-packages/flytekit/tools/fast_registration.py", line 94, in download_distribution
  file_access.get_data(additional_distribution, destination)
 File "/usr/local/lib/python3.8/site-packages/flytekit/core/data_persistence.py", line 427, in get_data
  raise FlyteAssertion(
flytekit.exceptions.user.FlyteAssertion: Failed to get data from gs://flyte-ts-temp-service-flyte/t2/flytesnacks/development/2GTQFPWGXQNYTUULNF2SODVCXM======/scriptmode.tar.gz to /root (recursive=False).
​
Original exception: Called process exited with error code: 1. Stderr dump:
​
b'ServiceException: 401 Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.\n'

Following this slack thread, it appears this is due to a missing configuration for gsutil the Dockerfile requires. This is related to standalone gsutil not being able to authenticate without additional config.

See this Dockerfile. It runs:

RUN pip install gsutil

but it is missing the boto.cfg configuration required for GKE workload identity to work.

In order to solve this, Soren Brunk on Slack suggested adding the following to the Dockerfile and running pyflyte with a custom derived image:

FROM ghcr.io/flyteorg/flytekit:py3.8-1.0.3

# Required for gsutil to work with workload-identity
RUN echo '[GoogleCompute]\nservice_account = default' > /etc/boto.cfg

That last line adds the configuration required for the workload identity to work.

This is only needed for GCP-specific deployments. So a solution should only apply for that environment.

Expected behavior

The pods should start up and not error with a 401 error.

Additional context to reproduce

  1. Follow GCP manual deployment directions (or Opta deployment)
  2. Run a workflow using pyflyte.

Screenshots

Slack reference: This thread post and below.

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

tszumowski avatar Jul 29 '22 18:07 tszumowski

@samhita-alla @SmritiSatyanV - do you think you can find a place to add this to the docs somewhere? We should also figure out how to support this in the default gcs case.

Also, i think @kumare3 found that if you use the fsspec plugin, it gets around this issue.

wild-endeavor avatar Aug 05 '22 18:08 wild-endeavor

Sure @wild-endeavor , will take a look at this.

SmritiSatyanV avatar Aug 06 '22 09:08 SmritiSatyanV