Aggregate doesn't respect union's names
What happened?
Group aggregation after append doesn't preserve column alias (nor order) which leads to broken aggregations
PRQL input
prql target:sql.snowflake
let shape = p -> (
p
select {
interaction_day = s"DATE_TRUNC('day', ts)",
user_id = s"substr(user, length('u:') + 1)",
}
)
from table1
shape
append (from table2 | shape)
group { user_id } (
aggregate {
daily_interaction = count interaction_day
}
)
SQL output
WITH table_0 AS (
SELECT
DATE_TRUNC('day', ts) AS interaction_day,
substr(user, length('u:') + 1) AS user_id
FROM
table2
),
table_1 AS (
SELECT
substr(user, length('u:') + 1) AS user_id,
DATE_TRUNC('day', ts) AS _expr_0
FROM
table1
UNION
ALL
SELECT
*
FROM
table_0
)
SELECT
user_id,
COUNT(*) AS daily_interaction
FROM
table_1
GROUP BY
user_id
-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)
Expected SQL output
WITH table_0 AS (
SELECT
DATE_TRUNC('day', ts) AS interaction_day,
substr(user, length('u:') + 1) AS user_id
FROM
table2
),
table_1 AS (
SELECT
substr(user, length('u:') + 1) AS user_id,
DATE_TRUNC('day', ts) AS interaction_day
FROM
table1
UNION
ALL
SELECT
*
FROM
table_0
)
SELECT
user_id,
COUNT(*) AS daily_interaction
FROM
table_1
GROUP BY
user_id
-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)
MVCE confirmation
- [X] Minimal example
- [ ] New issue
Anything else?
No response
Thanks @avikam .
(For others looking — the error is DATE_TRUNC('day', ts) AS _expr_0)
The naming of the column as _expr_0 is one error. The other is that the order of the columns in the first part of table_1 before the UNION (as was pointed out in the original report).
I believe the Expected SQL Output should actually be:
WITH table_0 AS (
SELECT
DATE_TRUNC('day', ts) AS interaction_day,
substr(user, length('u:') + 1) AS user_id
FROM
table2
),
table_1 AS (
SELECT
DATE_TRUNC('day', ts) AS interaction_day,
substr(user, length('u:') + 1) AS user_id
FROM
table1
UNION
ALL
SELECT
*
FROM
table_0
)
SELECT
user_id,
COUNT(*) AS daily_interaction
FROM
table_1
GROUP BY
user_id
-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)
yeah it might be two separate issues.
the order issue can be reproduced with:
from tbl1
select { y, x }
append (
from tbl2
select { y, x }
)
group { x } (
aggregate {
tot = count y
}
)
which currently transpiles as
WITH table_0 AS (
SELECT
y,
x
FROM
tbl2
),
table_1 AS (
SELECT
x,
y
FROM
tbl1
UNION
ALL
SELECT
*
FROM
table_0
)
SELECT
x,
COUNT(*) AS tot
FROM
table_1
GROUP BY
x
looking more into it, i think the ordering issue can be reproduced even without an aggregation. Consider this example:
from tbl1
select { x, y }
append (from tbl2 | select { x, y } )
select { y, x}
This pipeline sequence is not split, and the last select dominates the projection of tbl1
select y, x from tbl1 union all select * from table_0 -- table_0 is a correct CTE
the reason i think it is related to the first issue, is that in that case there is a Union followed by Select, Aggregate which does introduce a split. Then anchor_split appends a projection transform, with columns that their order is determined by the Aggregate's partition and compute arguments. When these are not in the same order as the first select, we get a similar behavior.
all of that should be taken with a grain of salt, as this all is very speculative.