snowpark-python SNOW-781480: group_by and groupBy methods add undesired cols to select

Expected behaviour:

The following Python code

from logging import getLogger, INFO, FileHandler
from snowflake.snowpark.session import Session
from snowflake.snowpark.functions import col, count, to_date

# set logger to info, and save queries to log-file
getLogger("snowflake.connector.cursor").setLevel(INFO)
getLogger("snowflake.connector.cursor").addHandler(FileHandler("cursor.log"))

# start session, change query length such that full query is displayed
session = Session.builder.configs(
    dict(
        authenticator="externalbrowser",
        account="",
        user="",
        role="",
        warehouse="",
        database="",
        schema="",
        log_max_query_length=1000000,
    )
).create()

# run query
session.table("table").group_by(col("col_a")).agg(count("*").as_("num")).collect()

Should result in the first query shown below, but actually results in the second (see generated log-file), note how col_a is added to the select, which is undesired.

SELECT count(1) AS "NUM" FROM ( SELECT  *  FROM table) GROUP BY "COL_A"
SELECT "COL_A", count(1) AS "NUM" FROM ( SELECT  *  FROM table) GROUP BY "COL_A"

Why this is a problem

Because of this, it is not possible to rename a column. The following Python code, for example:

session.table("table").group_by(col("event_day")).agg(
    to_date(col("event_date")).as_("event_day"), count("*")
).collect()

Would ideally result in the first query shown below, but actually results in the second query, which gives an error, because the first select statement (EVENT_DAY) is undefined. It is also not possible to rename the column in the group_by as this is not allowed.

SELECT to_date("EVENT_DATE") AS "EVENT_DAY", count(1) AS "COUNT(LITERAL())" FROM ( SELECT  *  FROM table) GROUP BY "EVENT_DAY"
SELECT "EVENT_DAY", to_date("EVENT_DATE") AS "EVENT_DAY", count(1) AS "COUNT(LITERAL())" FROM ( SELECT  *  FROM table) GROUP BY "EVENT_DAY"

Work-around

The way to circumvent this is to start with a select statement:

session.table("table").select(
    to_date(col("event_date")).as_("event_day")
).group_by(col("event_day")).agg(count("*").as_("num")).collect()

but this results in a nested select statement, which is slower

SELECT "EVENT_DAY", count(1) AS "NUM" FROM ( SELECT to_date("EVENT_DATE") AS "EVENT_DAY" FROM table) GROUP BY "EVENT_DAY"

Apr 06 '23 14:04 dvorst

thanks for your feedback, at a glance the field EVENT_DAY is not needed in the generated sql.

@sfc-gh-yixie @sfc-gh-jdu do you think it's a case that we can improve?

Apr 07 '23 20:04 sfc-gh-aling

My understanding is usually when we group by a column the result has the group by column and the aggregate column. Your workaround is a good workaround. Another way to do it is:

session.table("table").group_by(col("event_day")).agg(
    to_date(col("event_date")).as_("event_day"), count("*")
)

# rename
session.table("table").group_by(col("event_date")).agg(
    to_date(col("event_date")), count("*")
).with_column_renamed("event_date", "event_day")

May 11 '23 04:05 sfc-gh-yixie