[Hive | Profiling] The Hive Profiling workflow throws some errors due to how the queries are formatted

Open IceS2 opened this issue 1 year ago • 0 comments

Affected module Ingestion Framework Profiler

Describe the bug While running a Profiler workflow on Hive I got thrown this error:

[SQL: WITH `test_rnd` AS 
(SELECT `default`.`test`.`id` AS `id`, ABS(RAND()) * 100 %% %(param_1)s AS `random` 
FROM `default`.`test` ), 
`test_sample` AS 
(SELECT `test_rnd`.`id` AS `id`, `test_rnd`.`random` AS `random` 
FROM `test_rnd` 
WHERE `test_rnd`.`random` <= %(random_1)s)
 SELECT count(*) AS `uniqueCount` 
FROM (SELECT count(`id`) AS `count_1` 
FROM `test_sample` GROUP BY `id` 
HAVING count(`id`) = %(count_2)s) AS `only_once`
 LIMIT %(param_2)s]
[parameters: {'param_1': 100, 'random_1': 5.0, 'count_2': 1, 'param_2': 1}]

It looks like hive doesn't like the default.test.id column name. The following query gives the same error on a SQL Client:

SELECT `default`.`test`.`id` from `default`.`test`;

But the following one works fine:

SELECT `test`.`id` from `default`.`test`;

To Reproduce

Spin up Openmetadata from the 1.4 branch
Spin up a Hive environment

docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --name metastore-standalone apache/hive:3.1.3
docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:3.1.3

Create a Table with the following DDL

docker exec -it hive4 beeline -u 'jdbc:hive2://localhost:10000/'

CREATE TABLE test (id int);

Register and Run a Metadata ingestion
Register and Run a Profiling Ingestion
Check the Workflow and HiveServer2 Logs

Expected behavior No errors

Version:

OS: N/A
Python version: N/A
OpenMetadata version: 1.4.0 branch
OpenMetadata Ingestion package version: 1.4.0 branch

Additional context N/A

May 02 '24 15:05 IceS2