druid icon indicating copy to clipboard operation
druid copied to clipboard

Incorrect virtual column reference in `UNNEST` native query plan

Open abhishekrb19 opened this issue 9 months ago • 5 comments

I have the following SQL query involving unnest and two CTEs that doesn't yield correct results. The query appears to be generating an incorrect native query plan:

WITH cte1 AS (
  SELECT 
    c1, 
    TIME_FLOOR(__time, 'PT5M') AS floored_time,
    STRLEN(c1) AS c1_len
  FROM unnest_ds
  GROUP BY 1, 2
),
cte2 AS (
  SELECT 
    c1,
    ARRAY_AGG(ARRAY[floored_time, c1_len]) AS pair
  FROM cte1
  GROUP BY 1
)

SELECT 
  tr.c1, 
  tv
FROM cte2 AS tr
CROSS JOIN UNNEST(pair) AS tv

The table unnest_ds can be created with this simple DML query with an inline table:

REPLACE INTO unnest_ds
OVERWRITE ALL
SELECT TIME_PARSE(ts) AS __time, c1
FROM (VALUES('2025-01-01 10:00:00', 'ABD'), ('2025-01-02 10:00:00', 'AX'), ('2025-02-01 10:00:00', 'BYOD')) AS t(ts, c1) 
PARTITIONED BY HOUR

The above SQL query results in a native plan:

Native JSON query
{
  "queryType": "scan",
  "dataSource": {
    "type": "unnest",
    "base": {
      "type": "query",
      "query": {
        "queryType": "groupBy",
        "dataSource": {
          "type": "query",
          "query": {
            "queryType": "groupBy",
            "dataSource": {
              "type": "table",
              "name": "unnest_ds"
            },
            "intervals": {
              "type": "intervals",
              "intervals": [
                "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
              ]
            },
            "virtualColumns": [
              {
                "type": "expression",
                "name": "v0",
                "expression": "timestamp_floor(\"__time\",'PT5M',null,'UTC')",
                "outputType": "LONG"
              }
            ],
            "granularity": {
              "type": "all"
            },
            "dimensions": [
              {
                "type": "default",
                "dimension": "c1",
                "outputName": "d0",
                "outputType": "STRING"
              },
              {
                "type": "default",
                "dimension": "v0",
                "outputName": "d1",
                "outputType": "LONG"
              }
            ],
            "limitSpec": {
              "type": "NoopLimitSpec"
            },
            "context": {
              "queryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
              "sqlOuterLimit": 1001,
              "sqlQueryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
              "sqlStringifyArrays": false,
              "timestampResultField": "d1",
              "timestampResultFieldGranularity": "\"FIVE_MINUTE\"",
              "timestampResultFieldInOriginalDimensions": 1,
              "useNativeQueryExplain": true
            }
          }
        },
        "intervals": {
          "type": "intervals",
          "intervals": [
            "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
          ]
        },
        "virtualColumns": [
          {
            "type": "expression",
            "name": "v0",
            "expression": "array(\"d1\",strlen(\"d0\"))",
            "outputType": "ARRAY<LONG>"
          }
        ],
        "granularity": {
          "type": "all"
        },
        "dimensions": [
          {
            "type": "default",
            "dimension": "d0",
            "outputName": "_d0",
            "outputType": "STRING"
          }
        ],
        "aggregations": [
          {
            "type": "expression",
            "name": "a0",
            "fields": [
              "v0"
            ],
            "accumulatorIdentifier": "__acc",
            "initialValue": "ARRAY<ARRAY<LONG>>[]",
            "initialCombineValue": "ARRAY<ARRAY<LONG>>[]",
            "isNullUnlessAggregated": true,
            "shouldAggregateNullInputs": true,
            "shouldCombineAggregateNullInputs": false,
            "fold": "array_append(\"__acc\", \"v0\")",
            "combine": "array_concat(\"__acc\", \"a0\")",
            "maxSizeBytes": 1024
          }
        ],
        "limitSpec": {
          "type": "NoopLimitSpec"
        },
        "context": {
          "queryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
          "sqlOuterLimit": 1001,
          "sqlQueryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
          "sqlStringifyArrays": false,
          "useNativeQueryExplain": true
        }
      }
    },
    "virtualColumn": {
      "type": "expression",
      "name": "j0.unnest",
      "expression": "timestamp_floor(\"__time\",'PT5M',null,'UTC')",
      "outputType": "LONG"
    },
    "unnestFilter": null
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "resultFormat": "compactedList",
  "limit": 1001,
  "columns": [
    "_d0",
    "j0.unnest"
  ],
  "context": {
    "queryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
    "sqlOuterLimit": 1001,
    "sqlQueryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
    "sqlStringifyArrays": false,
    "useNativeQueryExplain": true
  },
  "columnTypes": [
    "STRING",
    "ARRAY<LONG>"
  ],
  "granularity": {
    "type": "all"
  },
  "legacy": false
}

The problematic part of the plan is:

    "virtualColumn": {
      "type": "expression",
      "name": "j0.unnest",
      "expression": "timestamp_floor(\"__time\",'PT5M',null,'UTC')",
      "outputType": "LONG"
    },

The unnest column is using the timestamp_floor expression from the first CTE instead of the a0 expression from the second CTE whose output type is "ARRAY<ARRAY<LONG>>".

Updating that in the plan and running it yields correct results.

Updated plan with the correct unnest expression
{
  "queryType": "scan",
  "dataSource": {
    "type": "unnest",
    "base": {
      "type": "query",
      "query": {
        "queryType": "groupBy",
        "dataSource": {
          "type": "query",
          "query": {
            "queryType": "groupBy",
            "dataSource": {
              "type": "table",
              "name": "unnest_ds"
            },
            "intervals": {
              "type": "intervals",
              "intervals": [
                "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
              ]
            },
            "virtualColumns": [
              {
                "type": "expression",
                "name": "v0",
                "expression": "timestamp_floor(\"__time\",'PT5M',null,'UTC')",
                "outputType": "LONG"
              }
            ],
            "granularity": {
              "type": "all"
            },
            "dimensions": [
              {
                "type": "default",
                "dimension": "c1",
                "outputName": "d0",
                "outputType": "STRING"
              },
              {
                "type": "default",
                "dimension": "v0",
                "outputName": "d1",
                "outputType": "LONG"
              }
            ],
            "limitSpec": {
              "type": "NoopLimitSpec"
            },
            "context": {
              "queryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
              "sqlOuterLimit": 1001,
              "sqlQueryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
              "sqlStringifyArrays": false,
              "timestampResultField": "d1",
              "timestampResultFieldGranularity": "\"FIVE_MINUTE\"",
              "timestampResultFieldInOriginalDimensions": 1,
              "useNativeQueryExplain": true
            }
          }
        },
        "intervals": {
          "type": "intervals",
          "intervals": [
            "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
          ]
        },
        "virtualColumns": [
          {
            "type": "expression",
            "name": "v0",
            "expression": "array(\"d1\",strlen(\"d0\"))",
            "outputType": "ARRAY<LONG>"
          }
        ],
        "granularity": {
          "type": "all"
        },
        "dimensions": [
          {
            "type": "default",
            "dimension": "d0",
            "outputName": "_d0",
            "outputType": "STRING"
          }
        ],
        "aggregations": [
          {
            "type": "expression",
            "name": "a0",
            "fields": [
              "v0"
            ],
            "accumulatorIdentifier": "__acc",
            "initialValue": "ARRAY<ARRAY<LONG>>[]",
            "initialCombineValue": "ARRAY<ARRAY<LONG>>[]",
            "isNullUnlessAggregated": true,
            "shouldAggregateNullInputs": true,
            "shouldCombineAggregateNullInputs": false,
            "fold": "array_append(\"__acc\", \"v0\")",
            "combine": "array_concat(\"__acc\", \"a0\")",
            "maxSizeBytes": 1024
          }
        ],
        "limitSpec": {
          "type": "NoopLimitSpec"
        },
        "context": {
          "queryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
          "sqlOuterLimit": 1001,
          "sqlQueryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
          "sqlStringifyArrays": false,
          "useNativeQueryExplain": true
        }
      }
    },
    "virtualColumn": {
      "type": "expression",
      "name": "j0.unnest",
      "expression": "a0",
      "outputType": "ARRAY<ARRAY<LONG>>"
    },
    "unnestFilter": null
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "resultFormat": "compactedList",
  "limit": 1001,
  "columns": [
    "_d0",
    "j0.unnest"
  ],
  "context": {
    "queryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
    "sqlOuterLimit": 1001,
    "sqlQueryId": "cde89494-705c-4edd-a8a1-dda7cc7f5457",
    "sqlStringifyArrays": false,
    "useNativeQueryExplain": true
  },
  "columnTypes": [
    "STRING",
    "ARRAY<LONG>"
  ],
  "granularity": {
    "type": "all"
  },
  "legacy": false
}
<       "expression": "timestamp_floor(\"__time\",'PT5M',null,'UTC')",
<       "outputType": "LONG"
---
>       "expression": "a0",
>       "outputType": "ARRAY<ARRAY<LONG>>"

Actual resultset:

c1 tv
ABD [0]
AX [0]
BYOD [0]

Expected resultset:

c1 tv
ABD [1735725600000,3]
AX [1735812000000,2]
BYOD [1738404000000,4]

Affected Version

32.0.0 & 2c46787d8c399b96f88dcf3a14c68b6753b12464 (tip of master)

Description

The issue appears to be related to how virtual columns are generated and handled during the native query planning stage. Specifically, when a GROUP BY clause is present in the first CTE, the planner incorrectly references virtual columns from the first native query in the unnest phase, instead of those generated by the second native query. Removing the GROUP BY from the first CTE and modifying the structure of the second CTE resolves the issue, but this approach loses the original grouping context.

abhishekrb19 avatar Apr 28 '25 19:04 abhishekrb19

Linking https://github.com/apache/druid/pull/17952 that adds a unit test to help reproduce the issue.

abhishekrb19 avatar Apr 28 '25 21:04 abhishekrb19

Seems like a genuine issue. My initial guess would be look into : https://github.com/apache/druid/blob/6d5b88c8b5b9358c97e71302de8787a44408809e/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidCorrelateUnnestRel.java#L79

cryptoe avatar Apr 30 '25 13:04 cryptoe

Possibly the same issue as #17949?

Fwiw, I tried this query with "plannerStrategy": "DECOUPLED" and it generated the correct plan. So the issue must be with the coupled planner rules, possibly the file @cryptoe pointed at. There were some changes to that file in #14510 that look possibly related. A bunch of code was added in that patch that unwraps DruidOuterQueryRel in various ways; perhaps one of these sections is doing an incorrect operation. If so, that would explain why the expression is seemingly drawn from the wrong subquery.

gianm avatar May 08 '25 04:05 gianm

Thanks @gianm @cryptoe. Yes, setting "plannerStrategy": "DECOUPLED" is the current workaround we have that produces correct results.

Related, some unnest queries in the decoupled planning mode generated sub-optimal plans taking up too much time and/or resources. I pulled in this change from 33.0.0: https://github.com/apache/druid/pull/17550, and after that, the plans looked much better. Leaving this comment here in case someone using an older Druid version encounters a similar issue with unnest.

abhishekrb19 avatar May 08 '25 05:05 abhishekrb19

Good to know! Fwiw, there is a current effort led by @kgyrtkirk to push towards decoupled planning being the primary planning method, which should be helpful.

gianm avatar May 08 '25 05:05 gianm