SNOW-1622441: Calling UDFs recreates deleted rows as `nan` and shuffles row values
Please answer these questions before submitting your issue. Thanks!
- What version of Python are you using?
Python 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)]
- What are the Snowpark Python and pandas versions in the environment?
pandas==2.2.2 snowflake-snowpark-python==1.20.0
- What did you do?
Deleted a row from a table. When I select remaining rows from that table the deleted row gets recreated, and the values are shuffled.
>>> from snowflake.snowpark.functions import call_udf, col, lit
>>> from snowflake.snowpark.session import Session
>>>
>>>
>>> def add_one(val: int) -> int:
... return val + 1
...
>>>
>>> session = Session.builder.config("local_testing", True).create()
>>> session.udf.register(add_one, name="add_one")
<snowflake.snowpark.mock._udf.MockUserDefinedFunction object at 0x149035d90>
>>>
>>> df = session.create_dataframe([(1),(2),(3)], schema=["a"])
>>> df.write.save_as_table("my_table", table_type="temporary")
>>>
>>> t = session.table("my_table")
>>> t.show()
-------
|"A" |
-------
|1 |
|2 |
|3 |
-------
# row is correctly deleted
>>> t.delete(t["a"] == 1)
DeleteResult(rows_deleted=1)
>>> t.show()
-------
|"A" |
-------
|2 |
|3 |
-------
# calling a udf recreates the deleted column with nan and shuffles the remaining values
>>> t.with_column("added", call_udf("add_one", col("a"))).show()
-----------------
|"A" |"ADDED" |
-----------------
|2 |4 |
|3 |nan |
|nan |3 |
-----------------
# `select` has the same result as `with_column`
>>> t.select(col("a"), call_udf("add_one", col("a")).alias("added")).show()
-----------------
|"A" |"ADDED" |
-----------------
|2 |4 |
|3 |nan |
|nan |3 |
-----------------
# `alias` is not the issue
>>> t.select(col("a"), call_udf("add_one", col("a"))).show()
--------------------------
|"A" |"ADD_ONE(""A"")" |
--------------------------
|2 |4 |
|3 |nan |
|nan |3 |
--------------------------
# udf is the issue because using a lit works
>>> t.select(col("a"), lit("blah").alias("added")).show()
-----------------
|"A" |"ADDED" |
-----------------
|2 |blah |
|3 |blah |
-----------------
- What did you expect to see?
Deleted row should not have been recreated with nan and rows should not be shuffled.
Hey @sfc-gh-jrose I know you are working on it. I am new to this snowpark-python open source community and I would like to solve this bug. I did go through the dataframe classes and didn't able to find out where this data is messing up while returning the value. If you can share some insights what needs to be look for that would be great. Thanks in advance
There have been a number of changes recently to fix bugs related to nans. This issue does not appear to be happening in the latest release likely due to one of those changes. Closing this issue due to now being able to reproduce.