tensorstore icon indicating copy to clipboard operation
tensorstore copied to clipboard

Use AWS C++ SDK for communicating with S3

Open sjperkins opened this issue 1 year ago • 25 comments

Based on this discussion:

  • https://github.com/google/tensorstore/issues/114#issuecomment-1936418508

TODO:

  • ~~[x] Integrate AWS CPP SDK #defines into a config.h file variant. https://github.com/google/tensorstore/pull/149#discussion_r1548284979~~

sjperkins avatar Apr 02 '24 11:04 sjperkins

So a first order strategy here might be:

and in S3RequestBuilder::BuildRequest:

  • Create a dummy Aws::Http::StandardHttpRequest.
    • Aws::String objects are simply typedef std::basic_strings so that shouldn't be an issue.
    • An Adapter will be needed to represent the absl::Cord payload as an Aws::Iostream.
  • Sign the Aws::Http::StandardHttpRequest with an AwsAuthV4Signer using the Aws::Auth::DefaultCredentialProviderChain.
  • Get the Authorization Header off the Aws::Http::StandardHttpRequest and use in tensorstore::http::HttpRequest
  • Resolve S3 Endpoints with Aws::S3EndPoint::S3EndPointProvider (apart from the core sdk dependency this would require a dependency on the s3 sdk).

I guess at some point the S3 Encryption API will become relevant.

Edit: Updated with EndPointProvider dependency.

sjperkins avatar Apr 03 '24 15:04 sjperkins

Still WIP, none of the existing codepaths have been touched.

Its still in prototype form, but its at the point where I'm confident that some variant of S3RequesBuilder that uses the AWS SDK Auth mechanisms could work.

I'm still unsure how much of the AWS S3 Endpoint calculation code could be separated from S3Client -- parts of it seem tightly coupled.

sjperkins avatar Apr 26 '24 16:04 sjperkins

Hello, just checking in here. I've been reading more of the AWS C++ SDK and Tensorstore code base.

It's probably possible to provide Adapters for the AWS C++ SDK classes in order to adapt them for use by the tensorstore-equivalent implementation:

  • Aws::Http::HttpRequest
  • Aws::Http::HttpResponse
  • Aws::Http::HttpClient
  • Aws::Utils::Threading::Executor

This would probably allow the use of the various models S3Client, for example

  • https://sdk.amazonaws.com/cpp/api/LATEST/aws-cpp-sdk-s3/html/class_aws_1_1_s3_1_1_model_1_1_head_bucket_request.html
  • https://sdk.amazonaws.com/cpp/api/LATEST/aws-cpp-sdk-s3/html/class_aws_1_1_s3_1_1_model_1_1_put_object_request.html

See below for the idea with a HEAD request

  class CustomExecutor : public Aws::Utils::Threading::Executor {
  protected:
    // Compiles, but only demonstrates the idea
    // Defer to tensorstore executor
    bool SubmitToThread(std::function<void()> && fn) override {
      auto task = ::tensorstore::ExecutorTask(fn);
      auto bound_fn = ::tensorstore::WithExecutor(..., task);
      ...
      return true;
    }
  };

  // In s3_context.cc
  // Override Aws::Http::HttpClient during AWS setup, with the intention of using
  // tensorflow's HttpClient (and curl implementation)
  // Aws::Http::SetHttpClientFactory(Aws::MakeShared<CustomHttpFactory>(kAWSTag));
  auto ctx = GetAwsContext();

  auto cfg = Aws::Client::ClientConfiguration();
  // Override the default AWS executor with tensorflow's Executor
  cfg.executor = Aws::MakeShared<CustomExecutor>(kAWSTag);

  auto client = Aws::S3::S3Client(cfg);
  // Inherits from Aws::Http::HttpRequest
  auto head_bucket = Aws::S3::Model::HeadBucketRequest().WithBucket("bucket");
  // Under the hood, submits Aws::Http::HttpRequest to CustomHttpClient
  // which converts to a tensorstore::internal_http::HttpRequest and submits it on
  // a tensorstore::Executor
  auto outcome = client.HeadBucket(head_bucket);
  if(!outcome.IsSuccess()) {
    auto & err = outcome.GetError();
    std::cerr << "Error: " << err.GetExceptionName() << ": " << err.GetMessage() << std::endl;
  } else {
    std::cout << "Success" << std::endl;
  }

The benefit of this approach is that most of the S3 Request logic such as

  • request construction
  • endpoint computation
  • xml response handling (maybe, haven't dug into the responses yet)

will be handled by the AWS C++ SDK, while the I/O would be performed by tensorstore's implementations. But this approach might be deemed too heavyweight in terms of code inclusion. I would like to know if there are concerns here before pursuing it further.

sjperkins avatar May 03 '24 13:05 sjperkins

Some incremental progress on the S3Client approach:

  • Overrides default SDK
    • HttpFactory to create a custom HttpClient
    • CustomHttpClient converts AWS HttpRequest into tensorstore HttpRequests for submission on tensorstore's curl implementation
    • Aws::IOStream objects are supplied with CordStreamBuf streambuf's, thereby constructing and receiving the underlying Cord's produced by tensorstore's Http stack.
    • Conversely, converts tensorstore HttpResponses to AWS HttpResponses
    • S3Client can be configured to use an Executor wrapping a tensorstore executor
  • Some basic test cases interacting with localstack that demonstrate use of S3Client with Sync and Async Request styles

Further investigation required:

  • ~~SDK uses Aws::IOStream for Http payloads and there's expensive conversion between Cord and Aws::IOStream. Aws::IOStream should probably be modified to encapsulate an abseil::Cord~~.
  • CustomHttpClient waits for a tensorstore Future to complete. Ideally, I would like to find some way to yield to the calling tensorstore Executor, but I'm not sure if this is possible. I'll mark the relevant location.

sjperkins avatar May 10 '24 16:05 sjperkins

I started looking at this again yesterday. Based on the HEAD repositories in github it appears that we need the following libraries imported:

"https://github.com/awslabs/aws-c-http",
"https://github.com/awslabs/aws-c-io",
"https://github.com/awslabs/aws-c-auth",
"https://github.com/awslabs/aws-c-mqtt",
"https://github.com/awslabs/aws-c-cal",
"https://github.com/awslabs/aws-c-compression",
"https://github.com/awslabs/aws-checksums",
"https://github.com/awslabs/aws-c-sdkutils",
"https://github.com/awslabs/aws-c-common",
"https://github.com/awslabs/aws-c-event-stream",
"https://github.com/awslabs/aws-c-s3",
"https://github.com/awslabs/aws-crt-cpp",
"https://github.com/aws/s2n-tls",

And then the C++ sdk has a bunch of sub-libraries.

"https://github.com/aws/aws-sdk-cpp",

"aws-cpp-sdk-core",
"aws-cpp-sdk-s3",

laramiel avatar Jun 14 '24 20:06 laramiel

Thanks, I'll take a look.

sjperkins avatar Jun 17 '24 20:06 sjperkins

"https://github.com/awslabs/aws-c-http",
"https://github.com/awslabs/aws-c-io",
"https://github.com/awslabs/aws-c-auth",
"https://github.com/awslabs/aws-c-mqtt",
"https://github.com/awslabs/aws-c-cal",
"https://github.com/awslabs/aws-c-compression",
"https://github.com/awslabs/aws-checksums",
"https://github.com/awslabs/aws-c-sdkutils",
"https://github.com/awslabs/aws-c-common",
"https://github.com/awslabs/aws-c-event-stream",
"https://github.com/awslabs/aws-c-s3",
"https://github.com/awslabs/aws-crt-cpp",
"https://github.com/aws/s2n-tls",
"https://github.com/aws/aws-sdk-cpp",

"aws-cpp-sdk-core",
"aws-cpp-sdk-s3",

Given that the above are all built with cmake, it may be worth using https://github.com/bazelbuild/rules_foreign_cc to integrate them into bazel, even though rules_foreign_cc isn't officially supported. Let me know if this isn't feasible for some reason.

sjperkins avatar Jun 18 '24 14:06 sjperkins

I don't think that we want rules_foreign_cc; internally we will need to build with bazel anyway.

laramiel avatar Jun 24 '24 16:06 laramiel

I don't think that we want rules_foreign_cc; internally we will need to build with bazel anyway.

I see. Based on a brief look through aws-c-common HEAD (only), one concern was generating the config.h.in file

  • https://github.com/awslabs/aws-c-common/blob/main/include/aws/common/config.h.in

In a strict sense, it seems that a pure bazel approach would need to customise the above for combinations of different platforms and architectures, for each dependency, perhaps through the use of write_file and @//platforms as done here: https://github.com/google/tensorstore/pull/149/files#diff-cee7f961ceb075b160fbdd37a847f8f68be0872f932924267589a78d4dbe8c57R175-R203

This seems doable for one repo. One concern is that a trial and error approach to this might introduce subtle bugs into the SDK build (e.g. builds with newer instructions sets, although these could be disabled for the sake of safety).

However, I'm probably overthinking this. Did the configs of the above dependencies require extensive customisation in the bazel build mentioned here?

  • https://github.com/aws/aws-sdk-cpp/issues/3002#issuecomment-2181754016

sjperkins avatar Jun 24 '24 17:06 sjperkins

Yes. There are only, I think, about 2 config.h.in style files in the aws repositories that we need, this one in aws_c_common and another in aws_cpp_sdk/aws_cpp_sdk_core or something.

I think that the starting point for c-common is something like

# Configured for Linux x86_64.
_SUBS = {
    "#cmakedefine AWS_HAVE_GCC_OVERFLOW_MATH_EXTENSIONS": "#define AWS_HAVE_GCC_OVERFLOW_MATH_EXTENSIONS",
    "#cmakedefine AWS_HAVE_GCC_INLINE_ASM": "#define AWS_HAVE_GCC_INLINE_ASM",
    "#cmakedefine AWS_HAVE_MSVC_INTRINSICS_X64": "",
    "#cmakedefine AWS_HAVE_POSIX_LARGE_FILE_SUPPORT": "#define AWS_HAVE_POSIX_LARGE_FILE_SUPPORT",
    "#cmakedefine AWS_HAVE_EXECINFO": "#define AWS_HAVE_EXECINFO",
    "#cmakedefine AWS_HAVE_WINAPI_DESKTOP": "",
    "#cmakedefine AWS_HAVE_LINUX_IF_LINK_H": "#define AWS_HAVE_LINUX_IF_LINK_H",
    "#cmakedefine AWS_HAVE_AVX2_INTRINSICS": "#define AWS_HAVE_AVX2_INTRINSICS",
    "#cmakedefine AWS_HAVE_AVX512_INTRINSICS": "#define AWS_HAVE_AVX512_INTRINSICS",
    "#cmakedefine AWS_HAVE_MM256_EXTRACT_EPI64": "#define AWS_HAVE_MM256_EXTRACT_EPI64",
    "#cmakedefine AWS_HAVE_CLMUL": "#define AWS_HAVE_CLMUL",
    "#cmakedefine AWS_HAVE_ARM32_CRC": "",
    "#cmakedefine AWS_HAVE_ARMv8_1": "",
    "#cmakedefine AWS_ARCH_ARM64": "",
    "#cmakedefine AWS_ARCH_INTEL": "#define AWS_ARCH_INTEL",
    "#cmakedefine AWS_ARCH_INTEL_X64": "#define AWS_ARCH_INTEL_X64",
    "#cmakedefine AWS_USE_CPU_EXTENSIONS": "#define AWS_USE_CPU_EXTENSIONS",
}

But we could probably just write the config file directly, as we do in se_curl

Or we could specialize the individual cmake lines like we do in jpeg

Or maybe we pre-generate files, like we do in cares, although the per-system #cmakedefine exists there as well.

... for most of the builds where I had to generate this, I mostly ran the cmake in windows, linux, and on my arm-mac, and did a 3-way merge to see what was common and what was different. Thinking about it now, I think that there is enough commonality here that we could have a library for it.

laramiel avatar Jun 25 '24 02:06 laramiel

... it might be nice to contribute aws rules to something like the bazel-central repository: https://github.com/bazelbuild/bazel-central-registry/tree/main/modules

But that happened after we started, so we haven't done that with our other modules, and we use the old-style workspace approach.

laramiel avatar Jun 25 '24 02:06 laramiel

The build now includes more recent versions of the AWS C, CRT and C++ SDKs.

FWIW the s3_sdk localstack test cases pass locally on linux, but this PR hasn't been tested on Windows or Mac. Unfortunately, I may not have extra bandwidth (or easy access to hardware) for these environments. In these cases, best-guesses have been applied in the BUILD scripts, but require further scrutiny. Might it be possible to try this out on your end and iron out the differences in review?

There's also probably scope to use the aws-cpp-sdk-s3-crt for improved performance, instead of the original aws-cpp-sdk-s3.

  • https://aws.amazon.com/blogs/storage/improving-amazon-s3-throughput-for-the-aws-cli-and-boto3-with-the-aws-common-runtime/

sjperkins avatar Jul 03 '24 11:07 sjperkins

I've been getting internal imports of these aws c libraries; I wonder if we should not include the aws_cpp_sdk libraries at all, and just limit the use to the aws_crt_cpp libraries; it feels like the aws-cpp-sdk-s3-crt library belongs in aws_crt_cpp...

Mostly, I think, we want the aws_c_auth capabilities to sign requests.

laramiel avatar Jul 09 '24 22:07 laramiel

I wonder if we should not include the aws_cpp_sdk libraries at all, and just limit the use to the aws_crt_cpp libraries; it feels like the aws-cpp-sdk-s3-crt library belongs in aws_crt_cpp...

I think aws-cpp-sdk-s3-crt has a dependency on aws-cpp-sdk-core so it's still seems to be part of the AWS C++ SDK library,

https://github.com/aws/aws-sdk-cpp/blob/77b525c975bb2e9840661d4f5beaf5171c8c3673/generated/src/aws-cpp-sdk-s3-crt/include/aws/s3-crt/S3CrtClient.h#L7-L22

and aws-cpp-sdk-core has a dependency on the aws-cpp-crt

https://github.com/aws/aws-sdk-cpp/blob/77b525c975bb2e9840661d4f5beaf5171c8c3673/src/aws-cpp-sdk-core/include/aws/core/Aws.h#L16-L17

Mostly, I think, we want the aws_c_auth capabilities to sign requests.

Without having looked at the internals of aws_c_auth, this seems plausible.

The benefit of using aws-cpp-sdk-s3-crt would be depending on reference implementations of:

  • retry logic
  • xml response handling
  • endpoint setup
  • maybe other encryption-related functionality like AWS KMS

which may be preferable to maintaining tensorstore implementations of the above.

Which approach do you favour?

sjperkins avatar Jul 10 '24 07:07 sjperkins

I'm travelling in July but I'll try find space to swap the S3CrtClient into the test cases.

sjperkins avatar Jul 10 '24 07:07 sjperkins

I'm currently getting these aws libraries up to aws-crt-cpp imported into our internal build system. I think that I'd like to start with the aws-crt-cpp / aws_c_s3 libraries.

The aws-cpp-sdk-s3-crt library seems to be built on the aws-c-s3 + aws-crt-cpp along with autogenerated mechanisms to parse the xml responses (in the model directory). However it's likely that the raw aws-c-s3 libraries will have most of the official retry, etc. done without the xml response parsing, and I'd like to target that as a first attempt.

To that end I may import the third_party bits that you have here. If you had time to separate that into a distinct pull request it would be great.

laramiel avatar Jul 10 '24 17:07 laramiel

I'm currently getting these aws libraries up to aws-crt-cpp imported into our internal build system. I think that I'd like to start with the aws-crt-cpp / aws_c_s3 libraries.

The aws-cpp-sdk-s3-crt library seems to be built on the aws-c-s3 + aws-crt-cpp along with autogenerated mechanisms to parse the xml responses (in the model directory). However it's likely that the raw aws-c-s3 libraries will have most of the official retry, etc. done without the xml response parsing, and I'd like to target that as a first attempt.

To that end I may import the third_party bits that you have here. If you had time to separate that into a distinct pull request it would be great.

Sure. I'll aim to make that available by the end of this week,

sjperkins avatar Jul 10 '24 17:07 sjperkins

... I actually plan on getting the third_party working since it needs to work for windows / mac. So just stay tuned.

Basic changes that I've made:

All the aws libraries now start with aws_... rather than github_com_aws...

The BUILD files are all slightly incorrect. Most of them use something like srcs = glob([ "includes/**/*.h", "sources/**/*.c" ]), however the "public" .h files should be globbed into hdrs = ...

Then there are some other changes to make aws_c_common / aws_c_io work based on the CMakeLists.txt files for those repositories.

laramiel avatar Jul 10 '24 21:07 laramiel

... I actually plan on getting the third_party working since it needs to work for windows / mac. So just stay tuned.

OK, just to overcommunicate: I take this to mean to hold off on the aws third_party PR. Let me know if I'm misinterpreting, otherwise I'll wait on your changes :-)

However it's likely that the raw aws-c-s3 libraries will have most of the official retry, etc. done without the xml response parsing, and I'd like to target that as a first attempt.

I had a closer look at aws-c-s3. It does advertise retry logic and there's also endpoint resolution code. Given the reduced dependency footprint I could see why this would be an appealing direction to take.

Edit: It's also got some xml response parsing: https://github.com/awslabs/aws-c-s3/blob/51e24febe7542466a998a1e8fee7ac3785084f55/source/s3_list_parts.c#L57-L160

sjperkins avatar Jul 11 '24 07:07 sjperkins

Yes; I published my WIP here for now: https://github.com/google/tensorstore/pull/177

I think that it's pretty close.

laramiel avatar Jul 11 '24 22:07 laramiel

Thanks, that build looks more comprehensive. Regarding the aws-c-s3 approach, I could try a PR on this when I return at the start of August. If that timeline is too far out for any deliverables, please don't wait on me :-)

I'll aim to integrate those libraries into this branch at about the same time.

sjperkins avatar Jul 14 '24 15:07 sjperkins

I don't have a timeline for updating s3; I work on it in between other things, somewhat like you do.

Anyway, some pointers / ideas.

The aws-c-s3 github repository has an s3 sample which covers most of what we need to do with s3: reading and listing. https://github.com/awslabs/aws-c-s3/blob/main/samples/s3/

laramiel avatar Jul 15 '24 17:07 laramiel

I could try a PR on this when I return at the start of August.

Apologies, I haven't had time to get to this due to time off sick and conferences. Hoping to pick it up next week.

sjperkins avatar Aug 12 '24 07:08 sjperkins

S3 now supports conditional writes, although other implementations may not:

https://aws.amazon.com/about-aws/whats-new/2024/08/amazon-s3-conditional-writes/

sjperkins avatar Aug 21 '24 05:08 sjperkins

Apologies, I haven't had time to get to this due to time off sick and conferences. Hoping to pick it up next week.

I need to get some releases out. I'm still interested, but I'm not sure when I'll have space in the near future.

sjperkins avatar Aug 29 '24 19:08 sjperkins