explorer icon indicating copy to clipboard operation
explorer copied to clipboard

Transform is not working on 0.3.x

Open petrkozorezov opened this issue 3 years ago • 7 comments

Mix.install([{:explorer, "0.3.1"}])
Explorer.DataFrame.new(a: ["a", "b"], b: [1, 2])
|> Explorer.DataFrame.mutate_with(&[c: Explorer.Series.transform(&1[:b], fn p -> p end)])
** (RuntimeError) cannot perform operation on an Explorer.Backend.LazySeries
    (explorer 0.3.1) lib/explorer/backend/lazy_series.ex:451: Explorer.Backend.LazySeries.transform/2
    (explorer 0.3.1) lib/explorer/series.ex:2093: Explorer.Series.transform/2
    (stdlib 3.17) erl_eval.erl:685: :erl_eval.do_apply/6
    (stdlib 3.17) erl_eval.erl:893: :erl_eval.expr_list/6
    (stdlib 3.17) erl_eval.erl:237: :erl_eval.expr/5
    (stdlib 3.17) erl_eval.erl:229: :erl_eval.expr/5
    (explorer 0.3.1) lib/explorer/data_frame.ex:1398: Explorer.DataFrame.mutate_with/2

petrkozorezov avatar Sep 18 '22 10:09 petrkozorezov

Correct. mutate_with now performs a lazy operation and we cannot perform a transform lazily. I think this may work:

df = Explorer.DataFrame.new(a: ["a", "b"], b: [1, 2])
c = Explorer.Series.transform(df[:b], fn p -> p end)
Explorer.DataFrame.mutate(c: c)

However, this will stop working on v0.4. We could allow it to work but it means people can accidentally write eager operations when they should be lazy. I think we need to introduce a specific API for replacing one or more columns in a dataframe. @cigrainger, do you have any suggestions? I can think of two:

  1. Explorer.DataFrame.replace(c: Explorer.Series.transform(df[:b], fn p -> p end)) - it works pretty much as mutate today, but it is eager
  2. Explorer.DataFrame.put(df, :c, Explorer.Series.transform(df[:b], fn p -> p end)) - since we already implement the Access protocol

@cigrainger / @philss / @kimjoaoun thoughts?

josevalim avatar Sep 18 '22 10:09 josevalim

Similar problem.

How can I know which operations are available to be performed lazily?

Following https://hexdocs.pm/explorer/Explorer.DataFrame.html#mutate_with/2,

This function is similar to mutate/2, but allows complex operations to be performed, since it uses a virtual representation of the dataframe. The only requirement is that a series operation is returned.

I didn't get the meaning of The only requirement is that a series operation is returned.

df = Explorer.DataFrame.new(%{a: [1, 2], b: [3, 4]})

df
|> Explorer.DataFrame.mutate_with(&%{
  ab: Explorer.Series.concat(&1[:a], &1[:b])
})
** (RuntimeError) cannot perform operation on an Explorer.Backend.LazySeries
    (explorer 0.3.1) lib/explorer/backend/lazy_series.ex:451: Explorer.Backend.LazySeries.concat/2
    (elixir 1.14.0) lib/enum.ex:2468: Enum."-reduce/3-lists^foldl/2-0-"/3
    /Users/json/workspace/project/unus/carrier_umbrella/notebooks/data_transform_poc.livemd#cell:2fjiiwzf3zehfshdx7ui2o7pdb6bthrq:5: (file)
    /Users/json/workspace/project/unus/carrier_umbrella/notebooks/data_transform_poc.livemd#cell:2fjiiwzf3zehfshdx7ui2o7pdb6bthrq:4: (file)

nallwhy avatar Oct 08 '22 14:10 nallwhy

Oh, it is just not implemented!

https://github.com/elixir-nx/explorer/blob/main/lib/explorer/backend/lazy_series.ex#L450

  # The following functions are not implemented yet and should raise if used.
  funs = [
    {:concat, 2},
    {:fetch!, 2},
    {:mask, 2},
    {:from_list, 2},
    {:sample, 4},
    {:size, 1},
    {:slice, 2},
    {:take_every, 2},
    {:to_enum, 1},
    {:to_list, 1},
    {:transform, 2}
  ]

nallwhy avatar Oct 08 '22 14:10 nallwhy

Yes, I improved the error message. It is not implemented yet, a PR is welcome!

josevalim avatar Oct 08 '22 15:10 josevalim

@josevalim I'm trying to implement Explorer.Series.concat/2 with reference to Explorer.Series.coalesce/2.

#366

I have one question.

Should LazySeries be able to operate only with LazySeries? (Series + LazySeries => Series(eager) is not allowed?)

For example, Explorer.Series.coalesce/2 doesn't allow operation with Series and LazySeries.

df = Explorer.DataFrame.new(%{a: [1, nil, 3]})

df
|> Explorer.DataFrame.mutate_with(&%{b: 
  Explorer.Series.coalesce(Explorer.Series.from_list([1, nil, 3]), &1[:a])
})
** (ErlangError) Erlang error: :invalid_struct
    (explorer 0.3.1) Explorer.PolarsBackend.Native.s_coalesce(shape: (3,)
Series: '' [i64]
[
	1
	null
	3
], %Explorer.Backend.LazySeries{op: :column, args: ["a"], aggregation: false, window: false})
    (explorer 0.3.1) lib/explorer/polars_backend/shared.ex:17: Explorer.PolarsBackend.Shared.apply_series/3
    #cell:pfpfgga4zpyttonfkjhmgxyguftlfsbl:4: (file)
    #cell:pfpfgga4zpyttonfkjhmgxyguftlfsbl:3: (file)

nallwhy avatar Oct 09 '22 04:10 nallwhy

I think concat can work on non lazy series too, similar to how addition works, but the result must always be a lazy series.

josevalim avatar Oct 09 '22 07:10 josevalim

Oh, I'm getting to understand the concept of lazy series. Thanks!

nallwhy avatar Oct 09 '22 08:10 nallwhy

How about closing it and following up on this in #381?

nallwhy avatar Oct 21 '22 14:10 nallwhy

I think this is a separate problem. We can't really support transform for lazy series. :)

josevalim avatar Oct 21 '22 14:10 josevalim

Just to answer the question, I think the second option looks better:

Explorer.DataFrame.put(df, :c, Explorer.Series.transform(df[:b], fn p -> p end))

With put you can add or replace a column.

philss avatar Oct 21 '22 17:10 philss

Closing in favor of #414.

josevalim avatar Nov 18 '22 06:11 josevalim