datacontract-cli datacontract import --format glue

Hi,

I am currently developing a function to init a contract from a Glue Table using this boto3 API https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue/client/get_table.html. I would like to have your opinion about adding this feature to the CLI. It would look to something like

datacontract import --format glue --database my_db --table my_table

Mar 28 '24 21:03 SimonAuger

Interesting!

Supporting Data Catalogs (Dataplex, Glue Catalog, Purview, Unity Catalog, Hive, DataHub, Collibra) is definitely something on the roadmap.

Data contracts might become the source of metadata, and it will be natural to publish this information to technical data catalogs.

I am not quite sure, how important is import from catalogs vs. export/publish/push to catalogs and what will be the typical developer flows (contract-first vs. data-first). What do you think?

Mar 28 '24 21:03 jochenchrist

Interesting!

Supporting Data Catalogs (Dataplex, Glue Catalog, Purview, Unity Catalog, Hive, DataHub, Collibra) is definitely something on the roadmap.

Data contracts might become the source of metadata, and it will be natural to publish this information to technical data catalogs.

I am not quite sure, how important is import from catalogs vs. export/publish/push to catalogs and what will be the typical developer flows (contract-first vs. data-first). What do you think?

Yes I agree for the export phase. However, in my view, numerous tables currently exist without a data contract, and introducing the import function would facilitate the implementation of the data contract on those 'legacy' tables.

Mar 29 '24 08:03 SimonAuger

I second this @saugerDecathlon and I'd love to see a PR to add this import. :-)

Apr 02 '24 08:04 simonharrer

I can help with this. I working on a project where the source is provided by lakeformation cross-account sharing to the etl accounts. I already have some scrpt to extract the table config from glue for parquet and iceberg type tables.

This in combination with the export to dbt would speed up the pipeline development.

Apr 18 '24 14:04 jverhoeks

Great thank you @jverhoeks. This feature would be really helpful in my use cases

Apr 18 '24 14:04 SimonAuger

Did you had times to work on the feature @jverhoeks by any chance ? Otherwise I think I can push a PR on this topic by the end of the week

Apr 24 '24 13:04 SimonAuger

Hi, i created the code to extract the info but was checking how to implement this in datacontract-cli. This is simple code to extract the columns and hive partitions. Tested on hive with parquet/json/csv and iceberg tables.

I don't know how you want to implement this? I was thinking to create an engine "glue" and add there some common functions, but if you have time you can implement.

import boto3
import sys
import re

glue = boto3.client("glue")

# Get the parameters
database_name = sys.argv[1] if len(sys.argv) > 1 else None
table_regex = sys.argv[2] if len(sys.argv) > 2 else ".+"

# If no database name is provided, list all databases
if database_name is None:
    response = glue.get_databases()
    for database in response["DatabaseList"]:
        print(database["Name"])
    sys.exit()

# Get the tables
response = glue.get_tables(DatabaseName=database_name)
# todo catch exception if database does not exist


# Create a list of tables that match the regex
matching_tables = [
    table["Name"]
    for table in response["TableList"]
    if re.match(table_regex, table["Name"])
]

# Iterate through the tables
for table in matching_tables:
    # Get the table schema
    # todo catch exception if error
    response = glue.get_table(DatabaseName=database_name, Name=table)
    table_schema = response["Table"]["StorageDescriptor"]["Columns"]

    # Get the partition keys
    partition_keys = []
    if response["Table"].get("PartitionKeys") is not None:
        for pk in response["Table"]["PartitionKeys"]:
            table_schema.append(
                {"Name": pk["Name"], "Type": pk["Type"], "Comment": "Partition Key"}
            )

    for column in table_schema:
        print(
            "{},{},{},{},{}".format(
                database_name,
                table,
                column["Name"],
                column["Type"],
                column.get("Comment"),
            )
        )

Apr 24 '24 14:04 jverhoeks

First version: https://github.com/jverhoeks/datacontract-cli/blob/122-import-glue/datacontract/imports/glue_importer.py

need to align the formating, i use black by default

Apr 29 '24 16:04 jverhoeks

CLI uses ruff

ruff format

Apr 29 '24 16:04 jochenchrist

I have created the pr with some test code using Moto to mock the AWS Glue Api

https://github.com/datacontract/datacontract-cli/pull/166

May 01 '24 20:05 jverhoeks

@saugerDecathlon Could you review #166 if this fits your needs? I'd then close this issue. Feel free to open a new issue for details.

May 05 '24 08:05 jochenchrist

Thanks a lot @jverhoeks. I will test this today @jochenchrist

May 07 '24 07:05 SimonAuger

I have tested it. It works great (just had to "aws configure set default.region"). I think it would be great to add a input parameter to filter on 1 or a list of tables for the contract generation. datacontract import --format glue --source my_datatbase --tables [table1, table2] What do you think ? Do I open another issue for that ?

May 07 '24 13:05 SimonAuger

OK, let's do it. Do you have some capacity to contribute? We can keep this ticket open.

May 07 '24 17:05 jochenchrist

OK, let's do it. Do you have some capacity to contribute? We can keep this ticket open.

Yes I will try to find some time this week or the next one

May 14 '24 13:05 SimonAuger

With PR #230 this feature was added, thanks to @jpraetorius and is included in v0.10.6

@saugerDecathlon Could you test this feature, so we can close this issue then?

May 31 '24 05:05 jochenchrist

@saugerDecathlon did you had the chance to have a look already?

Jun 12 '24 08:06 simonharrer

Close as completed. Please open a new issue if there are any bugs.

Jul 01 '24 10:07 jochenchrist