kyuubi icon indicating copy to clipboard operation
kyuubi copied to clipboard

[Bug][Lineage] Collect tables referenced in filter conditions for lineage analysis

Open lyne7-sc opened this issue 5 months ago • 4 comments

Code of Conduct

Search before asking

  • [x] I have searched in the issues and found no similar issues.

Describe the bug

When a SQL query contains a subquery in the WHERE clause, the table referenced within the subquery are not included in the extracted upstream table lineage.

For example,

insert overwrite v2_catalog.db.tb3
select *
from v2_catalog.db.tb1 t1
where exists (select 1 from v2_catalog.db.tb2 t2 where t2.col1 = t1.col1);

the current result is:

Lineage(
        List("v2_catalog.db.tb1"),
        List("v2_catalog.db.tb3"),
        List(
          ("v2_catalog.db.tb3.col1", Set("v2_catalog.db.tb1.col1")),
          ("v2_catalog.db.tb3.col2", Set("v2_catalog.db.tb1.col2")),
          ("v2_catalog.db.tb3.col3", Set("v2_catalog.db.tb1.col3")))))

the output omits table v2_catalog.db.tb2, which is referenced in the filter condition.

So I propose to add a new a configuration to control whether to collect the tables referenced in filter conditions as lineage input tables

Affects Version(s)

1.11.0

Kyuubi Server Log Output


Kyuubi Engine Log Output


Kyuubi Server Configurations


Kyuubi Engine Configurations


Additional context

No response

Are you willing to submit PR?

  • [x] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • [ ] No. I cannot submit a PR at this time.

lyne7-sc avatar Sep 17 '25 13:09 lyne7-sc

Hello @lyne7-sc, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi.

github-actions[bot] avatar Sep 17 '25 13:09 github-actions[bot]

After a quick look, it seems that we haven't yet supported the Filter operator, this is why we can't extracted the input tables? The following code is the logic of the fallback operator, but the condition of Filter is not child plan.

case p =>
        p.children.map(extractColumnsLineage(
          _,
          parentColumnsLineage,
          inputTablesByPlan)).reduce(mergeColumnsLineage)

yabola avatar Sep 17 '25 17:09 yabola

In addition, based on https://github.com/apache/kyuubi/pull/7184 , I think it can be fix after adding a filter operator. I test on databricks , db.tb2 really should be upstream table

Image

yabola avatar Sep 18 '25 02:09 yabola

In addition, based on #7184 , I think it can be fix after adding a filter operator. I test on databricks , db.tb2 really should be upstream table

Image

Thanks for your comment and testing, I've submitted a PR to address the issue based on https://github.com/apache/kyuubi/pull/7184

lyne7-sc avatar Sep 18 '25 14:09 lyne7-sc