prql Aggregate doesn't respect union's names

What happened?

Group aggregation after append doesn't preserve column alias (nor order) which leads to broken aggregations

PRQL input

prql target:sql.snowflake

let shape = p -> (
  p
  select {
    interaction_day = s"DATE_TRUNC('day', ts)",
    user_id = s"substr(user, length('u:') + 1)",
  }
)

from table1
shape
append (from  table2 | shape)
group { user_id } (
  aggregate {
    daily_interaction = count interaction_day
  }
)

SQL output

WITH table_0 AS (
  SELECT
    DATE_TRUNC('day', ts) AS interaction_day,
    substr(user, length('u:') + 1) AS user_id
  FROM
    table2
),
table_1 AS (
  SELECT
    substr(user, length('u:') + 1) AS user_id,
    DATE_TRUNC('day', ts) AS _expr_0
  FROM
    table1
  UNION
  ALL
  SELECT
    *
  FROM
    table_0
)
SELECT
  user_id,
  COUNT(*) AS daily_interaction
FROM
  table_1
GROUP BY
  user_id

-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)

Expected SQL output

WITH table_0 AS (
  SELECT
    DATE_TRUNC('day', ts) AS interaction_day,
    substr(user, length('u:') + 1) AS user_id
  FROM
    table2
),
table_1 AS (
  SELECT
    substr(user, length('u:') + 1) AS user_id,
    DATE_TRUNC('day', ts) AS interaction_day
  FROM
    table1
  UNION
  ALL
  SELECT
    *
  FROM
    table_0
)
SELECT
  user_id,
  COUNT(*) AS daily_interaction
FROM
  table_1
GROUP BY
  user_id

-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)

MVCE confirmation

[X] Minimal example
[ ] New issue

Anything else?

No response

Sep 27 '23 16:09 avikam

Thanks @avikam .

(For others looking — the error is DATE_TRUNC('day', ts) AS _expr_0)

Sep 27 '23 17:09 max-sixty

The naming of the column as _expr_0 is one error. The other is that the order of the columns in the first part of table_1 before the UNION (as was pointed out in the original report).

I believe the Expected SQL Output should actually be:

WITH table_0 AS (
  SELECT
    DATE_TRUNC('day', ts) AS interaction_day,
    substr(user, length('u:') + 1) AS user_id
  FROM
    table2
),
table_1 AS (
  SELECT
    DATE_TRUNC('day', ts) AS interaction_day,
    substr(user, length('u:') + 1) AS user_id
  FROM
    table1
  UNION
  ALL
  SELECT
    *
  FROM
    table_0
)
SELECT
  user_id,
  COUNT(*) AS daily_interaction
FROM
  table_1
GROUP BY
  user_id

-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)

Sep 28 '23 19:09 snth

yeah it might be two separate issues.

the order issue can be reproduced with:

from tbl1
select { y, x }
append (
    from tbl2
    select { y, x }
)
group { x } (
  aggregate {
    tot = count y
  }
)

which currently transpiles as

WITH table_0 AS (
  SELECT
    y,
    x
  FROM
    tbl2
),
table_1 AS (
  SELECT
    x,
    y
  FROM
    tbl1
  UNION
  ALL
  SELECT
    *
  FROM
    table_0
)
SELECT
  x,
  COUNT(*) AS tot
FROM
  table_1
GROUP BY
  x

Sep 30 '23 14:09 avikam

looking more into it, i think the ordering issue can be reproduced even without an aggregation. Consider this example:

from tbl1
select { x, y }
append (from tbl2 | select { x, y } )
select { y, x}

This pipeline sequence is not split, and the last select dominates the projection of tbl1

select y, x from tbl1 union all select * from table_0 -- table_0 is a correct CTE

the reason i think it is related to the first issue, is that in that case there is a Union followed by Select, Aggregate which does introduce a split. Then anchor_split appends a projection transform, with columns that their order is determined by the Aggregate's partition and compute arguments. When these are not in the same order as the first select, we get a similar behavior.

all of that should be taken with a grain of salt, as this all is very speculative.

Oct 02 '23 01:10 avikam