[FEATURE]: Regression testing harness for post-migration validation for managed tabled (Data Assets)
Is there an existing issue for this?
- [X] I have searched the existing issues
Problem statement
As a customer, much of the risk and work of a large migration is testing and validation. There is currently no opinionated pattern from Databricks on how to do regression testing. UC migrations provide an impetus to build one, as this type of migration has high risk.
Proposed Solution
The feature should:
- Account for any data assets being migrated
- When run, validate post-migration datasets against their pre-migration counterparts. This includes all data assets (managed tables, external tables, views, volumes, ...). See https://github.com/databrickslabs/ucx/issues/906
- Show a view of the outcome of the migration, including success metrics e.g. "% of tables/views/etc successfully migrated", "# of tables migrated". Regression can be classified as dataset mismatch or data asset migration failure. Failures can be detected by capturing failures from the data asset migration tooling in UCX (pending), and tables can be compared against one another to ensure that data and metadata is identical.
- Provide a useful changelog of data and asset changes made that can be analyzed further. Similar to a system table. Example rough schema (includes many optional fields, can be split into denormalized asset-specific tables):
asset_type | hive_database_name | hive_table_name| dbfs_path | uc_metastore_name | uc_schema_name | uc_table_name | view_name | volume_path | migrated_flag | % data migrated
Additional Context
If any amount of rollback is possible in these kinds of migrations, that should also be considered in this feature's design.
@chase-edwards-db Please split this issue into data and code assets. Provide way deeper details of what constitutes a regression, how you think it's detected, how it's measured, etc
@nfx updated with specifics. Will break workloads out into distinct PR. Let me know if more details are needed.