snowpark-python icon indicating copy to clipboard operation
snowpark-python copied to clipboard

SNOW-646407: Referencing columns per source table/dataframe after join.

Open frsann opened this issue 3 years ago • 1 comments

What is the current behavior?

When we do a join on two dataframes that happen to have columns that are named the same, Snowpark adds a seemingly random string prefix to either of the column names ( for example "r_gxpp_TIMEZONE"). There is no mention in docs on how to get a reference to this random prefix programmatically, so referencing this column after the join is cumbersome, to say the least.

The workaround to rename the columns before the join df_joined = df1.join(df2.select(df2.col('timezone').as_('timezone_df2')), df1.col('id') == df2.col('id')) can get convoluted quickly when working with multiple joins.

What is the desired behavior?

I would like to be able to reference the columns like

df_joined = df1.join(df2, df1.col('id') == df2.col('id')) 
df_joined.select(df1.col('timezone').as_('timezone1'), df2.col('timezone').as_('timezone2'))

The bare minimum would be to document this behaviour (and the workaround) as well as provide a way of deducing the prefix added to the conflicting columns.

How would this improve snowflake-snowpark-python?

This would make the syntax match closer SQL where you can do something like

select
  df1.timezone as timezone1,
  df2.timezone as timezone2
from table1 as df1
  join table2 as df2 on df1.id = df2.id 

References, Other Background

frsann avatar Aug 16 '22 08:08 frsann

Hey @frsann, this is on our radar and we plan to remove this random prefix.

sfc-gh-jdu avatar Aug 16 '22 19:08 sfc-gh-jdu

@frsann this already works today

    df1 = session.sql("select 1 as a, 2 as b")
    df2 = session.sql("select 1 as a, 2 as b")
    df3 = df1.join(df2, df1["a"] == df2["a"])
    df3 = df3.select(df1["a"].alias("a1"), df2["a"].alias("a2"))
    df3.show()

We'll add more information in the docs. Plus, we're adding a parameter to easily add a suffix to overlapping column names so it's not random.

sfc-gh-yixie avatar Oct 07 '22 16:10 sfc-gh-yixie