[Docs] Provide documentation on where blob storage is accessed and which keys are being used
Description
There are several different places where various Flyte services or tools require blob storage access, and for folks new to the project, it would be nice to have a few notes on the internals of how API calls happen. The initial question I raised in Slack at https://flyte-org.slack.com/archives/CP2HDHKE1/p1707508706335559 was
Is there a good graph of the sequence of API calls between `pyflyte run --remote` / flyteadmin and blob storage endpoints? I'm mostly trying to have a good picture around when `CreateUploadLocation` gets called /used, etc
Some of this is documented in https://docs.flyte.org/en/latest/concepts/data_management.html, but it would be nice to extend that to include:
-
pyflyte run --remotecalls FlyteAdmin, calls blob storage to get a signed url and returns it to the client so that it can upload - when inputs / outputs are downloaded / uploaded and using what creds for native Flyte tasks
- when inputs / outputs are downloaded / uploaded for raw container tasks (i.e. by copilot)
- default config vs. other config mechanisms
While Flyte has a good default approach of using IRSA in AWS, that might not always be the best fit -- so it's helpful to know some of the specifics around blob storage access patterns so that the right solution can be employed.
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
Thank you for opening your first issue here! 🛠
🤖
Hey @ddl-ebrown! Great to see you diving deeper into the workings of Flyte. Your curiosity really adds value to the community!
Thank you for your detailed request. I'll try to provide some insights based on the information available in the Flyte codebase and documentation.
-
pyflyte run --remotecalls FlyteAdmin, calls blob storage to get a signed url and returns it to the client so that it can upload: Thepyflyte run --remotecommand interacts with FlyteAdmin and blob storage to get a signed URL and return it to the client for upload. This is done through theGetfunction in theAWSRemoteURLstruct, which is responsible for generating the signed URL. You can find more details in the flyteadmin/pkg/data/implementations/aws_remote_url.go file. -
When inputs / outputs are downloaded / uploaded and using what creds for native Flyte tasks: In Flyte, the process of downloading and uploading inputs and outputs for native tasks is managed by the CoreIOStrategy class. This class has two main attributes: download_mode and upload_mode. These attributes are instances of the IOStrategyDownloadMode and IOStrategyUploadMode classes respectively. The credentials used in this process would typically be managed by the underlying data storage or transfer service being used, and would need to be configured separately. You can find more details in the flyteidl/gen/pb_python/flyteidl/service/flyteadmin/flyteadmin/models/core_io_strategy.py file.
-
When inputs / outputs are downloaded / uploaded for raw container tasks (i.e. by copilot): The context does not provide information on this. It would be helpful if you could provide more details on what you mean by "raw container tasks" and "copilot".
-
Default config vs. other config mechanisms: Flyte services or tools use various blob storage access points, which are defined in the
Configstruct in theflytestdlib/storage/config.gofile. TheTypefield in theConfigstruct sets the type of storage to configure. TheConnectionfield defines the connection configurations, and theStowfield defines the configurations for the Stow backend. You can find more details in the flytestdlib/storage/config.go file.
I hope this information is helpful. If you have any further questions or need more details, please let me know.
Sources
- flytestdlib/storage/config.go
- flyteadmin/pkg/data/implementations/aws_remote_url.go
- flyteadmin/dataproxy/service.go
- flyteidl/gen/pb_python/flyteidl/service/flyteadmin/flyteadmin/models/core_io_strategy.py
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.