ucx icon indicating copy to clipboard operation
ucx copied to clipboard

[BUG]: Glue table size estimates not showing for estimate_table_size_for_migration task

Open bwmann89 opened this issue 1 year ago • 5 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

Table sizes for tables in local Hive database are displayed instead of those configured for the external Glue catalog:

When checking the following table, table sizes for the databases under the root bucket are displayed. ucx_7050934800642846.table_size;

Expected Behavior

The table sizes for all tables under the Glue external catalog should be displayed:

However, when running the following query, details for the Glue external tables are displayed, including their S3 bucket location. The tables listed under ucx_7050934800642846.tables and ucx_7050934800642846.table_size do not match up

select * from ucx_7050934800642846.tables

Steps To Reproduce

Databricks E2 environment with 0.53.1 version of UCX. Running UCX [assessment] job in an environment where we have many managed tables that are in a separate Glue catalog.

Cluster configuration: { "cluster_name": "job-684732593059343-run-616459509104165-main", "spark_version": "15.4.x-scala2.12", "spark_conf": { "spark.hadoop.hive.metastore.glue.catalogid": "772614087260", "spark.databricks.hive.metastore.glueCatalog.enabled": "true", "spark.databricks.cluster.profile": "singleNode", "spark.master": "local[*]" }, "aws_attributes": { "first_on_demand": 0, "availability": "ON_DEMAND", "zone_id": "auto", "instance_profile_arn": "arn:aws:iam::506921813928:instance-profile/delegatedadmin/developer/idrc-om-sysadmin-role", "spot_bid_price_percent": 100, "ebs_volume_count": 0 }, "node_type_id": "r5dn.xlarge", "driver_node_type_id": "r5dn.xlarge", "custom_tags": { "JobId": "684732593059343", "RunName": "UCX assessment", "ResourceClass": "SingleNode", "version": "v0.53.1" }, "autotermination_minutes": 0, "enable_elastic_disk": false, "policy_id": "001E82A97513496A", "data_security_mode": "SINGLE_USER", "runtime_engine": "STANDARD", "num_workers": 0 }

Cloud

AWS

Operating System

Linux

Version

latest via Databricks CLI

Relevant log output

Beginning of output:
22:20:06  INFO [d.labs.ucx] UCX v0.53.1 After job finishes, see debug logs at /Workspace/Applications/ucx/logs/assessment/run-616459509104165-0/estimate_table_size_for_migration.log
22:20:06 DEBUG [d.l.u.framework.crawlers] [hive_metastore.ucx_7050934800642846.table_size] fetching table_size inventory
22:20:06 DEBUG [d.l.lsql.backends] [spark][fetch] SELECT * FROM `hive_metastore`.`ucx_7050934800642846`.`table_size`
22:20:07 DEBUG [d.l.u.framework.crawlers] [hive_metastore.ucx_7050934800642846.table_size] crawling new set of snapshot data for table_size
22:20:07 DEBUG [d.l.u.framework.crawlers] [hive_metastore.ucx_7050934800642846.tables] fetching tables inventory
22:20:07 DEBUG [d.l.lsql.backends] [spark][fetch] SELECT * FROM `hive_metastore`.`ucx_7050934800642846`.`tables`
22:20:09 DEBUG [d.l.blueprint.parallel] Starting 4 tasks in 8 threads
22:20:09 DEBUG [d.l.u.hive_metastore.table_size][DBFS_root_table_sizes_0] Evaluating hive_metastore.demo_db.any_status_table table size.
22:20:09 DEBUG [d.l.lsql.backends][DBFS_root_table_sizes_0] [spark][execute] ANALYZE table `hive_metastore`.`demo_db`.`any_status_table` compute STATISTICS NOSCAN
22:20:09 DEBUG [d.l.u.hive_metastore.table_size][DBFS_root_table_sizes_2] Evaluating hive_metastore.demo_db.any_new_cntl_table table size.
22:20:09 DEBUG [d.l.u.hive_metastore.table_size][DBFS_root_table_sizes_3] Evaluating hive_metastore.demo_db.any_cntl_table table size.
22:20:09 DEBUG [d.l.u.hive_metastore.table_size][DBFS_root_table_sizes_1] Evaluating hive_metastore.demo_db.any_new_status_table table size.
22:20:09 DEBUG [d.l.lsql.backends][DBFS_root_table_sizes_2] [spark][execute] ANALYZE table `hive_metastore`.`demo_db`.`any_new_cntl_table` compute STATISTICS NOSCAN
22:20:09 DEBUG [d.l.lsql.backends][DBFS_root_table_sizes_3] [spark][execute] ANALYZE table `hive_metastore`.`demo_db`.`any_cntl_table` compute STATISTICS NOSCAN
22:20:09 DEBUG [d.l.lsql.backends][DBFS_root_table_sizes_1] [spark][execute] ANALYZE table `hive_metastore`.`demo_db`.`any_new_status_table` compute STATISTICS NOSCAN
22:20:14  WARN [d.l.u.hive_metastore.table_size][DBFS_root_table_sizes_0] Failed to evaluate hive_metastore.demo_db.any_status_table table size. Table not found.
22:20:15  WARN [d.l.u.hive_metastore.table_size][DBFS_root_table_sizes_1] Failed to evaluate hive_metastore.demo_db.any_new_status_table table size. Table not found.
22:20:15  WARN [d.l.u.hive_metastore.table_size][DBFS_root_table_sizes_3] Failed to evaluate hive_metastore.demo_db.any_cntl_table table size. Table not found.
22:20:15  WARN [d.l.u.hive_metastore.table_size][DBFS_root_table_sizes_2] Failed to evaluate hive_metastore.demo_db.any_new_cntl_table table size. Table not found.
22:20:15  INFO [d.l.blueprint.parallel][DBFS_root_table_sizes_2] DBFS root table sizes 4/4, rps: 0.639/sec
22:20:15  INFO [d.l.blueprint.parallel] Finished 'DBFS root table sizes' tasks: 0% results available (0/4). Took 0:00:06.262416
22:20:15 DEBUG [d.l.u.framework.crawlers] [hive_metastore.ucx_7050934800642846.table_size] found 0 new records for table_size

bwmann89 avatar Jan 02 '25 20:01 bwmann89

table_size.csv

bwmann89 avatar Jan 02 '25 20:01 bwmann89

Attached the output of the following queries: select * from ucx_7050934800642846.tables; select * from ucx_7050934800642846.table_size;

bwmann89 avatar Jan 02 '25 20:01 bwmann89

Attached are just a few of the DEBUG logs under /Applications/ucx/logs for the crawl_tables task showing how many tables we have in our environment. Also attached is the estimate_table_size_for_migration.log file showing only the tables under the Hive metastore. crawl_tables.log crawl_tables.log.2025-01-13_21-01.log crawl_tables.log.2025-01-13_21-02.log crawl_tables.log.2025-01-13_21-03.log estimate_table_size_for_migration.log

bwmann89 avatar Jan 13 '25 22:01 bwmann89

I'm missing the output for Crawl tables. In any case. Just to be clear the estimate_table_size_for_migration will list only the sizes of the tables that need to be physically copied (e.g. tables on DBFS root that are not supported by UC). The purpose of it is to estimate the extra storage required as well as an estimate of how long it will take to copy these tables.

FastLee avatar Jan 21 '25 18:01 FastLee

All log files for crawl_tables weren't attached, since there were over 20 of them for all of our tables. Thank you for clarifying the purpose of estimate_table_size_for_migration. We do not plan to move/copy any of those tables from their current locations in S3, but just want them managed by Unity Catalog. Due to this, there would be no copy or transfer of data required, but just a metadata change so that the tables are managed by UC?

bwmann89 avatar Jan 23 '25 15:01 bwmann89