SNOW-781480: group_by and groupBy methods add undesired cols to select
Expected behaviour:
The following Python code
from logging import getLogger, INFO, FileHandler
from snowflake.snowpark.session import Session
from snowflake.snowpark.functions import col, count, to_date
# set logger to info, and save queries to log-file
getLogger("snowflake.connector.cursor").setLevel(INFO)
getLogger("snowflake.connector.cursor").addHandler(FileHandler("cursor.log"))
# start session, change query length such that full query is displayed
session = Session.builder.configs(
dict(
authenticator="externalbrowser",
account="",
user="",
role="",
warehouse="",
database="",
schema="",
log_max_query_length=1000000,
)
).create()
# run query
session.table("table").group_by(col("col_a")).agg(count("*").as_("num")).collect()
Should result in the first query shown below, but actually results in the second (see generated log-file), note how col_a is added to the select, which is undesired.
SELECT count(1) AS "NUM" FROM ( SELECT * FROM table) GROUP BY "COL_A"
SELECT "COL_A", count(1) AS "NUM" FROM ( SELECT * FROM table) GROUP BY "COL_A"
Why this is a problem
Because of this, it is not possible to rename a column. The following Python code, for example:
session.table("table").group_by(col("event_day")).agg(
to_date(col("event_date")).as_("event_day"), count("*")
).collect()
Would ideally result in the first query shown below, but actually results in the second query, which gives an error, because the first select statement (EVENT_DAY) is undefined. It is also not possible to rename the column in the group_by as this is not allowed.
SELECT to_date("EVENT_DATE") AS "EVENT_DAY", count(1) AS "COUNT(LITERAL())" FROM ( SELECT * FROM table) GROUP BY "EVENT_DAY"
SELECT "EVENT_DAY", to_date("EVENT_DATE") AS "EVENT_DAY", count(1) AS "COUNT(LITERAL())" FROM ( SELECT * FROM table) GROUP BY "EVENT_DAY"
Work-around
The way to circumvent this is to start with a select statement:
session.table("table").select(
to_date(col("event_date")).as_("event_day")
).group_by(col("event_day")).agg(count("*").as_("num")).collect()
but this results in a nested select statement, which is slower
SELECT "EVENT_DAY", count(1) AS "NUM" FROM ( SELECT to_date("EVENT_DATE") AS "EVENT_DAY" FROM table) GROUP BY "EVENT_DAY"
thanks for your feedback, at a glance the field EVENT_DAY is not needed in the generated sql.
@sfc-gh-yixie @sfc-gh-jdu do you think it's a case that we can improve?
My understanding is usually when we group by a column the result has the group by column and the aggregate column.
Your workaround is a good workaround. Another way to do it is:
session.table("table").group_by(col("event_day")).agg(
to_date(col("event_date")).as_("event_day"), count("*")
)
# rename
session.table("table").group_by(col("event_date")).agg(
to_date(col("event_date")), count("*")
).with_column_renamed("event_date", "event_day")