iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

feat: support S3 Table Buckets with S3TablesCatalog

Open felixscherz opened this issue 1 year ago • 16 comments

Hi, this is in regards to #1404.

I created a first draft of an S3TablesCatalog that uses the S3 Table Bucket API for catalog operations.

How to run tests

Since moto does not support mocking the S3 Tables API yet (WIP: https://github.com/getmoto/moto/pull/8470) we have to run tests against a live AWS account. To do that, create an S3 Tables Bucket in one of the supported regions and then set the table bucket ARN and AWS Region as environment variables

AWS_REGION=us-east-2 AWS_TEST_S3_BUCKET_ARN=... poetry run pytest tests/catalog/integration_test_s3tables.py

felixscherz avatar Dec 14 '24 14:12 felixscherz

I was able to work around the issue above by using FsspecFileIO instead of the default PyarrowFileIO. Using FsspecFileIO the catalog is now able to create new tables.

felixscherz avatar Dec 14 '24 16:12 felixscherz

Thanks for working on this @felixscherz Feel free to tag me when its ready for review :)

kevinjqliu avatar Dec 20 '24 17:12 kevinjqliu

I think you can now review this PR if you have time @kevinjqliu :) The biggest issue for now will be that testing is only possible against AWS itself since moto does not support the s3tables API yet. I created an issue on the moto side but have not had the time to implement it myself https://github.com/getmoto/moto/issues/8422.

I currently run tests by setting the ARN env variable to that of an s3 table bucket I created within my personal AWS account:

https://github.com/felixscherz/iceberg-python/blob/feat/s3tables-catalog/tests/catalog/test_s3tables.py#L24-L31

felixscherz avatar Dec 29 '24 15:12 felixscherz

I ran the tests locally ARN=arn:aws:s3tables:us-east-2:... poetry run pytest tests/catalog/test_s3tables.py had to manually add s3tables.region to the catalog config

    properties = {"s3tables.table-bucket-arn": table_bucket_arn, "s3tables.region": "us-east-2", "py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}

And these 3 testa failed, everything else is ✅

FAILED tests/catalog/test_s3tables.py::test_s3tables_api_raises_on_conflicting_version_tokens - botocore.exceptions.NoRegionError: You must specify a region.
FAILED tests/catalog/test_s3tables.py::test_s3tables_api_raises_on_preexisting_table - botocore.exceptions.NoRegionError: You must specify a region.
FAILED tests/catalog/test_s3tables.py::test_creating_catalog_validates_s3_table_bucket_exists - botocore.exceptions.NoRegionError: You must specify a region.

kevinjqliu avatar Jan 04 '25 00:01 kevinjqliu

Thank you for the review!

I removed tests related to boto3 and set the AWS region explicitly for the test run. I agree with you that we should not merge this as long as the only option for running tests is to run them against a live AWS account. I'm currently working on supporting s3tables with moto: https://github.com/getmoto/moto/pull/8470. We can hold off on this PR until moto can support s3tables so we can run tests against a mock endpoint?

felixscherz avatar Jan 05 '25 16:01 felixscherz

can you run poetry lock --no-update for CI?

kevinjqliu avatar Jan 08 '25 04:01 kevinjqliu

@felixscherz could you rebase this against main?

i see that getmoto/moto/8470 is now merged, thanks for driving that! waiting for https://github.com/getmoto/moto/commit/4f565fb93a82a9f34fcb892de3d4098663cd12e9 to make it to the next release (5.0.28)

kevinjqliu avatar Feb 01 '25 00:02 kevinjqliu

I rebased onto the main. I prepared the unit tests using the new moto features and I will commit them once the new moto release is available:)

felixscherz avatar Feb 01 '25 15:02 felixscherz

@kevinjqliu moto==5.0.28 just got released, I added unit tests and documentation, could you have a look when you have time?

felixscherz avatar Feb 03 '25 08:02 felixscherz

its green! i'll review this so we can include it in the upcoming 0.9.0 release

kevinjqliu avatar Feb 12 '25 20:02 kevinjqliu

@geruh would you have another look at this PR when you have time?

felixscherz avatar Mar 13 '25 21:03 felixscherz

hey @felixscherz thanks for following up on this PR. I haven't forgotten about this! :) I'll do another review soon

kevinjqliu avatar Mar 14 '25 18:03 kevinjqliu

I believe this PR is no longer relevant since AWS started support the REST Catalog: https://github.com/apache/iceberg-python/issues/1404#issuecomment-2771891069 :)

felixscherz avatar Apr 09 '25 09:04 felixscherz

@felixscherz does this mean that nothing special is needed to use PyIceberg against an S3 Table bucket now? Thank you for your work on this -- our team has been waiting to use table buckets specifically until we could migrate some pyiceberg-based services.

mmadsen avatar Apr 09 '25 15:04 mmadsen

Yes! You use the Rest catalog and should be able to follow this guide: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-open-source.html. I haven't tested it myself however:)

felixscherz avatar Apr 09 '25 16:04 felixscherz

Hi- I think supporting the native s3table interface has merit; in particular because the rest path involves an extra hop (and involvement with the glue team in AWS)... and more tangibly, might suffer from issues like this one: https://github.com/apache/iceberg-python/issues/2511, which has had a major impact on our data flows for over a month. AWS confirmed that this was only a problem with rest / glue, whereas native s3table interactions would not have been impacted (and we're still waiting for the fix to be deployed). It would be great if we could just swap out a different catalog that interacts directly with s3tables. I'm a bit disappointed to find that https://github.com/apache/iceberg-python/issues/1404 was closed (and likely this PR abandoned) in favor of only supporting rest interactions. Can we revisit that decision? Thanks all 🙏

srstrickland avatar Oct 20 '25 18:10 srstrickland