feat: support S3 Table Buckets with S3TablesCatalog
Hi, this is in regards to #1404.
I created a first draft of an S3TablesCatalog that uses the S3 Table Bucket API for catalog operations.
How to run tests
Since moto does not support mocking the S3 Tables API yet (WIP: https://github.com/getmoto/moto/pull/8470) we have to run tests against a live AWS account. To do that, create an S3 Tables Bucket in one of the supported regions and then set the table bucket ARN and AWS Region as environment variables
AWS_REGION=us-east-2 AWS_TEST_S3_BUCKET_ARN=... poetry run pytest tests/catalog/integration_test_s3tables.py
I was able to work around the issue above by using FsspecFileIO instead of the default PyarrowFileIO. Using FsspecFileIO the catalog is now able to create new tables.
Thanks for working on this @felixscherz Feel free to tag me when its ready for review :)
I think you can now review this PR if you have time @kevinjqliu :) The biggest issue for now will be that testing is only possible against AWS itself since moto does not support the s3tables API yet. I created an issue on the moto side but have not had the time to implement it myself https://github.com/getmoto/moto/issues/8422.
I currently run tests by setting the ARN env variable to that of an s3 table bucket I created within my personal AWS account:
https://github.com/felixscherz/iceberg-python/blob/feat/s3tables-catalog/tests/catalog/test_s3tables.py#L24-L31
I ran the tests locally ARN=arn:aws:s3tables:us-east-2:... poetry run pytest tests/catalog/test_s3tables.py
had to manually add s3tables.region to the catalog config
properties = {"s3tables.table-bucket-arn": table_bucket_arn, "s3tables.region": "us-east-2", "py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}
And these 3 testa failed, everything else is ✅
FAILED tests/catalog/test_s3tables.py::test_s3tables_api_raises_on_conflicting_version_tokens - botocore.exceptions.NoRegionError: You must specify a region.
FAILED tests/catalog/test_s3tables.py::test_s3tables_api_raises_on_preexisting_table - botocore.exceptions.NoRegionError: You must specify a region.
FAILED tests/catalog/test_s3tables.py::test_creating_catalog_validates_s3_table_bucket_exists - botocore.exceptions.NoRegionError: You must specify a region.
Thank you for the review!
I removed tests related to boto3 and set the AWS region explicitly for the test run. I agree with you that we should not merge this as long as the only option for running tests is to run them against a live AWS account. I'm currently working on supporting s3tables with moto: https://github.com/getmoto/moto/pull/8470. We can hold off on this PR until moto can support s3tables so we can run tests against a mock endpoint?
can you run poetry lock --no-update for CI?
@felixscherz could you rebase this against main?
i see that getmoto/moto/8470 is now merged, thanks for driving that! waiting for https://github.com/getmoto/moto/commit/4f565fb93a82a9f34fcb892de3d4098663cd12e9 to make it to the next release (5.0.28)
I rebased onto the main. I prepared the unit tests using the new moto features and I will commit them once the new moto release is available:)
@kevinjqliu moto==5.0.28 just got released, I added unit tests and documentation, could you have a look when you have time?
its green! i'll review this so we can include it in the upcoming 0.9.0 release
@geruh would you have another look at this PR when you have time?
hey @felixscherz thanks for following up on this PR. I haven't forgotten about this! :) I'll do another review soon
I believe this PR is no longer relevant since AWS started support the REST Catalog: https://github.com/apache/iceberg-python/issues/1404#issuecomment-2771891069 :)
@felixscherz does this mean that nothing special is needed to use PyIceberg against an S3 Table bucket now? Thank you for your work on this -- our team has been waiting to use table buckets specifically until we could migrate some pyiceberg-based services.
Yes! You use the Rest catalog and should be able to follow this guide: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-open-source.html. I haven't tested it myself however:)
Hi- I think supporting the native s3table interface has merit; in particular because the rest path involves an extra hop (and involvement with the glue team in AWS)... and more tangibly, might suffer from issues like this one: https://github.com/apache/iceberg-python/issues/2511, which has had a major impact on our data flows for over a month. AWS confirmed that this was only a problem with rest / glue, whereas native s3table interactions would not have been impacted (and we're still waiting for the fix to be deployed). It would be great if we could just swap out a different catalog that interacts directly with s3tables. I'm a bit disappointed to find that https://github.com/apache/iceberg-python/issues/1404 was closed (and likely this PR abandoned) in favor of only supporting rest interactions. Can we revisit that decision? Thanks all 🙏