explorer icon indicating copy to clipboard operation
explorer copied to clipboard

Sorting an empty DataFrame results in a runtime Polars error

Open eldano opened this issue 1 year ago • 3 comments

Attempting to sort a dataframe with groups and no values results in a runtime error

dataframe = DataFrame.new(a: ["a", "b", "c"])

dataframe
|> DataFrame.group_by("a")
|> DataFrame.filter(a == "d")
|> DataFrame.sort_by(a)

Output:

** (RuntimeError) Polars Error: cannot group_by + apply on empty 'DataFrame'
    (explorer 0.8.2) lib/explorer/polars_backend/shared.ex:79: Explorer.PolarsBackend.Shared.apply_dataframe/4
    #cell:2m6ajrb7ypgepmrw:3: (file)

eldano avatar Jun 07 '24 15:06 eldano

Thanks for the issue!

It appears this may have been an issue on the Polars side that they addressed:

  • https://github.com/pola-rs/polars/issues/12194

But that fix was released as part of Polars 0.35 (PR 12269):

  • https://github.com/pola-rs/polars/releases/tag/rs-0.35.0

We've got a later version of Polars, so I'll have to do some more digging later.

billylanchantin avatar Jun 07 '24 15:06 billylanchantin

It could be something related with the order of the chained expressions;

# ❗ doesn't work like mentioned in the issue.

df = DF.new(a: ["a", "b", "c"])
|> DF.group_by("a")
|> DF.filter(a == "d")
|> DF.sort_by(a)
# ✔️ This one works
df |> DF.filter(a == "d") |> DF.sort_by(a) |>  DF.group_by("a")

I usually crosscheck with the python api. so; In the latest version of the api this doesn't work either.

df.group_by("a").filter(pl.lit("a").eq("d")).sort("a")

So my conclusion is, the order of the expressions are important.

ceyhunkerti avatar Sep 25 '24 22:09 ceyhunkerti

@ceyhunkerti I think this should still be permitted. For example:

import Explorer.DataFrame
require Explorer.DataFrame

df = new(a: ["a", "a", "b"])

# Broken
df |> group_by("a") |> filter(a == "d") |> sort_by(a)

# Works
df |> lazy |> group_by("a") |> filter(a == "d") |> sort_by(a) |> compute
# #Explorer.DataFrame<
#   Polars[0 x 1]
#   Groups: ["a"]
#   a string []
# >

AFAICT Polars group_by works a little differently. I believe they require aggregating before continuing work in most cases:

df.group_by("a").filter(False)
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# AttributeError: 'GroupBy' object has no attribute 'filter'

billylanchantin avatar Sep 26 '24 03:09 billylanchantin