datacontract import --format glue
Hi,
I am currently developing a function to init a contract from a Glue Table using this boto3 API https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue/client/get_table.html. I would like to have your opinion about adding this feature to the CLI. It would look to something like
datacontract import --format glue --database my_db --table my_table
Interesting!
Supporting Data Catalogs (Dataplex, Glue Catalog, Purview, Unity Catalog, Hive, DataHub, Collibra) is definitely something on the roadmap.
Data contracts might become the source of metadata, and it will be natural to publish this information to technical data catalogs.
I am not quite sure, how important is import from catalogs vs. export/publish/push to catalogs and what will be the typical developer flows (contract-first vs. data-first). What do you think?
Interesting!
Supporting Data Catalogs (Dataplex, Glue Catalog, Purview, Unity Catalog, Hive, DataHub, Collibra) is definitely something on the roadmap.
Data contracts might become the source of metadata, and it will be natural to publish this information to technical data catalogs.
I am not quite sure, how important is import from catalogs vs. export/publish/push to catalogs and what will be the typical developer flows (contract-first vs. data-first). What do you think?
Yes I agree for the export phase. However, in my view, numerous tables currently exist without a data contract, and introducing the import function would facilitate the implementation of the data contract on those 'legacy' tables.
I second this @saugerDecathlon and I'd love to see a PR to add this import. :-)
I can help with this. I working on a project where the source is provided by lakeformation cross-account sharing to the etl accounts. I already have some scrpt to extract the table config from glue for parquet and iceberg type tables.
This in combination with the export to dbt would speed up the pipeline development.
Great thank you @jverhoeks. This feature would be really helpful in my use cases
Did you had times to work on the feature @jverhoeks by any chance ? Otherwise I think I can push a PR on this topic by the end of the week
Hi, i created the code to extract the info but was checking how to implement this in datacontract-cli. This is simple code to extract the columns and hive partitions. Tested on hive with parquet/json/csv and iceberg tables.
I don't know how you want to implement this? I was thinking to create an engine "glue" and add there some common functions, but if you have time you can implement.
import boto3
import sys
import re
glue = boto3.client("glue")
# Get the parameters
database_name = sys.argv[1] if len(sys.argv) > 1 else None
table_regex = sys.argv[2] if len(sys.argv) > 2 else ".+"
# If no database name is provided, list all databases
if database_name is None:
response = glue.get_databases()
for database in response["DatabaseList"]:
print(database["Name"])
sys.exit()
# Get the tables
response = glue.get_tables(DatabaseName=database_name)
# todo catch exception if database does not exist
# Create a list of tables that match the regex
matching_tables = [
table["Name"]
for table in response["TableList"]
if re.match(table_regex, table["Name"])
]
# Iterate through the tables
for table in matching_tables:
# Get the table schema
# todo catch exception if error
response = glue.get_table(DatabaseName=database_name, Name=table)
table_schema = response["Table"]["StorageDescriptor"]["Columns"]
# Get the partition keys
partition_keys = []
if response["Table"].get("PartitionKeys") is not None:
for pk in response["Table"]["PartitionKeys"]:
table_schema.append(
{"Name": pk["Name"], "Type": pk["Type"], "Comment": "Partition Key"}
)
for column in table_schema:
print(
"{},{},{},{},{}".format(
database_name,
table,
column["Name"],
column["Type"],
column.get("Comment"),
)
)
First version: https://github.com/jverhoeks/datacontract-cli/blob/122-import-glue/datacontract/imports/glue_importer.py
need to align the formating, i use black by default
CLI uses ruff
ruff format
I have created the pr with some test code using Moto to mock the AWS Glue Api
https://github.com/datacontract/datacontract-cli/pull/166
@saugerDecathlon Could you review #166 if this fits your needs? I'd then close this issue. Feel free to open a new issue for details.
Thanks a lot @jverhoeks. I will test this today @jochenchrist
I have tested it. It works great (just had to "aws configure set default.region").
I think it would be great to add a input parameter to filter on 1 or a list of tables for the contract generation.
datacontract import --format glue --source my_datatbase --tables [table1, table2]
What do you think ? Do I open another issue for that ?
OK, let's do it. Do you have some capacity to contribute? We can keep this ticket open.
OK, let's do it. Do you have some capacity to contribute? We can keep this ticket open.
Yes I will try to find some time this week or the next one
With PR #230 this feature was added, thanks to @jpraetorius and is included in v0.10.6
@saugerDecathlon Could you test this feature, so we can close this issue then?
@saugerDecathlon did you had the chance to have a look already?
Close as completed. Please open a new issue if there are any bugs.