[Bug][Lineage] Collect tables referenced in filter conditions for lineage analysis
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Search before asking
- [x] I have searched in the issues and found no similar issues.
Describe the bug
When a SQL query contains a subquery in the WHERE clause, the table referenced within the subquery are not included in the extracted upstream table lineage.
For example,
insert overwrite v2_catalog.db.tb3
select *
from v2_catalog.db.tb1 t1
where exists (select 1 from v2_catalog.db.tb2 t2 where t2.col1 = t1.col1);
the current result is:
Lineage(
List("v2_catalog.db.tb1"),
List("v2_catalog.db.tb3"),
List(
("v2_catalog.db.tb3.col1", Set("v2_catalog.db.tb1.col1")),
("v2_catalog.db.tb3.col2", Set("v2_catalog.db.tb1.col2")),
("v2_catalog.db.tb3.col3", Set("v2_catalog.db.tb1.col3")))))
the output omits table v2_catalog.db.tb2, which is referenced in the filter condition.
So I propose to add a new a configuration to control whether to collect the tables referenced in filter conditions as lineage input tables
Affects Version(s)
1.11.0
Kyuubi Server Log Output
Kyuubi Engine Log Output
Kyuubi Server Configurations
Kyuubi Engine Configurations
Additional context
No response
Are you willing to submit PR?
- [x] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
- [ ] No. I cannot submit a PR at this time.
Hello @lyne7-sc, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi.
After a quick look, it seems that we haven't yet supported the Filter operator, this is why we can't extracted the input tables?
The following code is the logic of the fallback operator, but the condition of Filter is not child plan.
case p =>
p.children.map(extractColumnsLineage(
_,
parentColumnsLineage,
inputTablesByPlan)).reduce(mergeColumnsLineage)
In addition, based on https://github.com/apache/kyuubi/pull/7184 , I think it can be fix after adding a filter operator.
I test on databricks , db.tb2 really should be upstream table
In addition, based on #7184 , I think it can be fix after adding a filter operator. I test on databricks ,
db.tb2really should be upstream table![]()
Thanks for your comment and testing, I've submitted a PR to address the issue based on https://github.com/apache/kyuubi/pull/7184