datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Fix panic for `GROUPING SETS(())` and handle empty-grouping aggregates

Open kosiew opened this issue 2 months ago • 0 comments

Which issue does this PR close?

  • Closes #18974.

Rationale for this change

The DataFusion CLI currently panics with an "index out of bounds" error when executing queries that use GROUP BY GROUPING SETS(()), such as:

SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(())

This panic originates in the physical aggregation code, which assumes that an empty list of grouping expressions always corresponds to "no grouping". That assumption breaks down in the presence of GROUPING SETS, where an empty set is a valid grouping set that should still produce a result row (and __grouping_id) rather than crashing.

This PR fixes the panic by explicitly distinguishing:

  • true "no GROUP BY" aggregations, and
  • GROUPING SETS/CUBE/ROLLUP plans that may have empty grouping expressions but still require grouping-set semantics and a valid __grouping_id.

The change restores robustness of the CLI and ensures standards-compliant behavior for grouping sets with empty sets.

What changes are included in this PR?

Summary of the main changes:

  • Track grouping-set usage explicitly in PhysicalGroupBy:

    • Add a has_grouping_set: bool field to PhysicalGroupBy.

    • Extend PhysicalGroupBy::new to accept the has_grouping_set flag.

    • Add helper methods:

      • has_grouping_set(&self) -> bool to expose the flag, and
      • is_true_no_grouping(&self) -> bool to represent the case of genuinely no grouping (no GROUP BY and no grouping sets).
  • Correct group state construction for empty grouping with grouping sets:

    • Update PhysicalGroupBy::from_pre_group so that it only treats expr.is_empty() as "no groups" when has_grouping_set is false.
    • For GROUPING SETS(()), we now build at least one group, avoiding the previous out-of-bounds access on groups[0].
  • Clarify when __grouping_id should be present:

    • Replace the previous is_single logic with a clearer distinction based on has_grouping_set.
    • num_output_exprs, output_exprs, num_group_exprs, and group_schema now add the __grouping_id column only when has_grouping_set is true.
    • is_single is redefined as "simple GROUP BY" (no grouping sets), i.e. !self.has_grouping_set.
  • Integrate the new semantics into AggregateExec:

    • Use group_by.is_true_no_grouping() instead of group_by.expr.is_empty() when choosing between the specialized no-grouping aggregation path and grouped aggregation.
    • Ensure that is_unordered_unfiltered_group_by_distinct only treats plans as grouped when there are grouping expressions and no grouping sets (!has_grouping_set).
    • Preserve existing behavior for regular GROUP BY while correctly handling GROUPING SETS and related constructs.
  • Support __grouping_id with the no-grouping aggregation stream:

    • Extend AggregateStreamInner with an optional grouping_id: Option<ScalarValue> field.
    • Change AggregateStream::new to accept a grouping_id argument.
    • Introduce prepend_grouping_id_column to prepend a __grouping_id column to the finalized accumulator output when needed.
    • Wire this up so that no-grouping aggregations can still match a schema that includes __grouping_id in grouping-set scenarios.
  • Planner and execution wiring updates:

    • Update all PhysicalGroupBy::new call sites to pass the correct has_grouping_set value:

      • false for:

        • ordinary GROUP BY or truly no-grouping aggregates.
      • true for:

        • GROUPING SETS,
        • CUBE, and
        • ROLLUP physical planning paths.
    • Ensure merge_grouping_set_physical_expr, create_cube_physical_expr, and create_rollup_physical_expr correctly mark grouping-set plans.

  • Protobuf / physical plan round-trip support:

    • Extend AggregateExecNode in datafusion.proto with a new bool has_grouping_set = 12; field.
    • Update the generated pbjson and prost code to serialize and deserialize the new field.
    • When constructing AggregateExec from protobuf, pass the decoded has_grouping_set into PhysicalGroupBy::new.
    • When serializing an AggregateExec back to protobuf, set has_grouping_set based on exec.group_expr().has_grouping_set().
    • Update round-trip physical plan tests to include the new field in their expectations.
  • Tests and SQL logic coverage:

    • Add sqllogictests for the previously failing cases in grouping.slt:

      • SELECT COUNT(*) FROM test GROUP BY GROUPING SETS (());
      • SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(()) (the original panic case).
    • Extend or adjust unit tests in aggregates, physical_planner, filter_pushdown, and coop modules to account for the has_grouping_set flag in PhysicalGroupBy and expected debug output.

    • Update proto round-trip tests to validate has_grouping_set is preserved.

Are these changes tested?

Yes.

  • New sqllogictests covering GROUPING SETS(()) for both a regular table and generate_series(10):

    • grouping.slt now asserts the expected scalar results (e.g. 2 and 55), preventing regressions on this edge case.
  • Updated and existing Rust unit tests:

    • physical-plan/src/aggregates tests updated to include has_grouping_set in PhysicalGroupBy expectations.
    • Planner and optimizer tests (e.g. physical_planner.rs, filter_pushdown) updated to construct PhysicalGroupBy with the new flag.
    • Execution tests in core/tests/execution/coop.rs updated to reflect the new constructor and continue to exercise the no-grouping aggregation path.
    • Protobuf round-trip tests extended to verify that has_grouping_set is correctly serialized and deserialized.

These tests collectively ensure that:

  • the panic is fixed,
  • the aggregation semantics for GROUPING SETS(()) are correct, and
  • existing aggregate behavior remains unchanged for non-grouping-set queries.

Are there any user-facing changes?

Yes, but they are bug fixes and behavior clarifications rather than breaking changes:

  • Queries using GROUP BY GROUPING SETS(()) no longer cause a runtime panic in the DataFusion CLI.

    • Instead, they return the expected single aggregate row (e.g. COUNT(*) or SUM(v1)), consistent with SQL semantics.
  • For plans using GROUPING SETS, CUBE, or ROLLUP, the internal __grouping_id column is now present consistently whenever grouping sets are in use, even when the grouping expressions are empty.

  • For ordinary GROUP BY queries that do not use grouping sets, behavior is unchanged: no unexpected __grouping_id column is added.

No API signatures were changed in a breaking way for downstream users; the additions are internal flags and protobuf fields to accurately represent the physical plan.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew avatar Dec 10 '25 09:12 kosiew