Fix panic for `GROUPING SETS(())` and handle empty-grouping aggregates
Which issue does this PR close?
- Closes #18974.
Rationale for this change
The DataFusion CLI currently panics with an "index out of bounds" error when executing queries that use GROUP BY GROUPING SETS(()), such as:
SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(())
This panic originates in the physical aggregation code, which assumes that an empty list of grouping expressions always corresponds to "no grouping". That assumption breaks down in the presence of GROUPING SETS, where an empty set is a valid grouping set that should still produce a result row (and __grouping_id) rather than crashing.
This PR fixes the panic by explicitly distinguishing:
- true "no GROUP BY" aggregations, and
-
GROUPING SETS/CUBE/ROLLUPplans that may have empty grouping expressions but still require grouping-set semantics and a valid__grouping_id.
The change restores robustness of the CLI and ensures standards-compliant behavior for grouping sets with empty sets.
What changes are included in this PR?
Summary of the main changes:
-
Track grouping-set usage explicitly in
PhysicalGroupBy:-
Add a
has_grouping_set: boolfield toPhysicalGroupBy. -
Extend
PhysicalGroupBy::newto accept thehas_grouping_setflag. -
Add helper methods:
-
has_grouping_set(&self) -> boolto expose the flag, and -
is_true_no_grouping(&self) -> boolto represent the case of genuinely no grouping (no GROUP BY and no grouping sets).
-
-
-
Correct group state construction for empty grouping with grouping sets:
- Update
PhysicalGroupBy::from_pre_groupso that it only treatsexpr.is_empty()as "no groups" whenhas_grouping_setisfalse. - For
GROUPING SETS(()), we now build at least one group, avoiding the previous out-of-bounds access ongroups[0].
- Update
-
Clarify when
__grouping_idshould be present:- Replace the previous
is_singlelogic with a clearer distinction based onhas_grouping_set. -
num_output_exprs,output_exprs,num_group_exprs, andgroup_schemanow add the__grouping_idcolumn only whenhas_grouping_setistrue. -
is_singleis redefined as "simple GROUP BY" (no grouping sets), i.e.!self.has_grouping_set.
- Replace the previous
-
Integrate the new semantics into
AggregateExec:- Use
group_by.is_true_no_grouping()instead ofgroup_by.expr.is_empty()when choosing between the specialized no-grouping aggregation path and grouped aggregation. - Ensure that
is_unordered_unfiltered_group_by_distinctonly treats plans as grouped when there are grouping expressions and no grouping sets (!has_grouping_set). - Preserve existing behavior for regular
GROUP BYwhile correctly handlingGROUPING SETSand related constructs.
- Use
-
Support
__grouping_idwith the no-grouping aggregation stream:- Extend
AggregateStreamInnerwith an optionalgrouping_id: Option<ScalarValue>field. - Change
AggregateStream::newto accept agrouping_idargument. - Introduce
prepend_grouping_id_columnto prepend a__grouping_idcolumn to the finalized accumulator output when needed. - Wire this up so that no-grouping aggregations can still match a schema that includes
__grouping_idin grouping-set scenarios.
- Extend
-
Planner and execution wiring updates:
-
Update all
PhysicalGroupBy::newcall sites to pass the correcthas_grouping_setvalue:-
falsefor:- ordinary
GROUP BYor truly no-grouping aggregates.
- ordinary
-
truefor:-
GROUPING SETS, -
CUBE, and -
ROLLUPphysical planning paths.
-
-
-
Ensure
merge_grouping_set_physical_expr,create_cube_physical_expr, andcreate_rollup_physical_exprcorrectly mark grouping-set plans.
-
-
Protobuf / physical plan round-trip support:
- Extend
AggregateExecNodeindatafusion.protowith a newbool has_grouping_set = 12;field. - Update the generated
pbjsonandprostcode to serialize and deserialize the new field. - When constructing
AggregateExecfrom protobuf, pass the decodedhas_grouping_setintoPhysicalGroupBy::new. - When serializing an
AggregateExecback to protobuf, sethas_grouping_setbased onexec.group_expr().has_grouping_set(). - Update round-trip physical plan tests to include the new field in their expectations.
- Extend
-
Tests and SQL logic coverage:
-
Add sqllogictests for the previously failing cases in
grouping.slt:-
SELECT COUNT(*) FROM test GROUP BY GROUPING SETS (()); -
SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(())(the original panic case).
-
-
Extend or adjust unit tests in
aggregates,physical_planner,filter_pushdown, andcoopmodules to account for thehas_grouping_setflag inPhysicalGroupByand expected debug output. -
Update proto round-trip tests to validate
has_grouping_setis preserved.
-
Are these changes tested?
Yes.
-
New sqllogictests covering
GROUPING SETS(())for both a regular table andgenerate_series(10):-
grouping.sltnow asserts the expected scalar results (e.g.2and55), preventing regressions on this edge case.
-
-
Updated and existing Rust unit tests:
-
physical-plan/src/aggregatestests updated to includehas_grouping_setinPhysicalGroupByexpectations. - Planner and optimizer tests (e.g.
physical_planner.rs,filter_pushdown) updated to constructPhysicalGroupBywith the new flag. - Execution tests in
core/tests/execution/coop.rsupdated to reflect the new constructor and continue to exercise the no-grouping aggregation path. - Protobuf round-trip tests extended to verify that
has_grouping_setis correctly serialized and deserialized.
-
These tests collectively ensure that:
- the panic is fixed,
- the aggregation semantics for
GROUPING SETS(())are correct, and - existing aggregate behavior remains unchanged for non-grouping-set queries.
Are there any user-facing changes?
Yes, but they are bug fixes and behavior clarifications rather than breaking changes:
-
Queries using
GROUP BY GROUPING SETS(())no longer cause a runtime panic in the DataFusion CLI.- Instead, they return the expected single aggregate row (e.g.
COUNT(*)orSUM(v1)), consistent with SQL semantics.
- Instead, they return the expected single aggregate row (e.g.
-
For plans using
GROUPING SETS,CUBE, orROLLUP, the internal__grouping_idcolumn is now present consistently whenever grouping sets are in use, even when the grouping expressions are empty. -
For ordinary
GROUP BYqueries that do not use grouping sets, behavior is unchanged: no unexpected__grouping_idcolumn is added.
No API signatures were changed in a breaking way for downstream users; the additions are internal flags and protobuf fields to accurately represent the physical plan.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.