ucx icon indicating copy to clipboard operation
ucx copied to clipboard

[FEATURE]: Regression testing harness for post-migration validation for managed tabled (Data Assets)

Open chase-edwards-db opened this issue 2 years ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Problem statement

As a customer, much of the risk and work of a large migration is testing and validation. There is currently no opinionated pattern from Databricks on how to do regression testing. UC migrations provide an impetus to build one, as this type of migration has high risk.

Proposed Solution

The feature should:

  • Account for any data assets being migrated
  • When run, validate post-migration datasets against their pre-migration counterparts. This includes all data assets (managed tables, external tables, views, volumes, ...). See https://github.com/databrickslabs/ucx/issues/906
  • Show a view of the outcome of the migration, including success metrics e.g. "% of tables/views/etc successfully migrated", "# of tables migrated". Regression can be classified as dataset mismatch or data asset migration failure. Failures can be detected by capturing failures from the data asset migration tooling in UCX (pending), and tables can be compared against one another to ensure that data and metadata is identical.
  • Provide a useful changelog of data and asset changes made that can be analyzed further. Similar to a system table. Example rough schema (includes many optional fields, can be split into denormalized asset-specific tables):

asset_type | hive_database_name | hive_table_name| dbfs_path | uc_metastore_name | uc_schema_name | uc_table_name | view_name | volume_path | migrated_flag | % data migrated

Additional Context

If any amount of rollback is possible in these kinds of migrations, that should also be considered in this feature's design.

chase-edwards-db avatar Feb 08 '24 16:02 chase-edwards-db

@chase-edwards-db Please split this issue into data and code assets. Provide way deeper details of what constitutes a regression, how you think it's detected, how it's measured, etc

nfx avatar Feb 08 '24 16:02 nfx

@nfx updated with specifics. Will break workloads out into distinct PR. Let me know if more details are needed.

chase-edwards-db avatar Mar 26 '24 11:03 chase-edwards-db