daru icon indicating copy to clipboard operation
daru copied to clipboard

get_group unpredicted behaviour in case of Sorting applied

Open vanitu opened this issue 5 years ago • 2 comments

When group_by applied on sorted DataFrame get_group will return wrong entries in DataFrame

df=Daru::DataFrame.new([
                           10.times.collect{|i| i},
                           10.times.collect{|i| "b"},
                           10.times.collect{|i| i%2 == 0 ? "c" : "d"},
                       ],
                       order: [:a,:b,:c]
                       )


#Works Properly
grouped=df.group_by([:b,:c])
grouped.get_group(["b","c"])

=> #<Daru::DataFrame(5x3)>
       a   b   c
   0   0   b   c
   2   2   b   c
   4   4   b   c
   6   6   b   c
   8   8   b   c 

#Corrupted after sort applied to DF
df.sort!([:c])
grouped=df.group_by([:b,:c])
grouped.get_group(["b","c"])

=> #<Daru::DataFrame(5x3)>
       a   b   c
   0   0   b   c
   2   4   b   c
   4   8   b   c
   6   3   b   d
   8   7   b   d 

vanitu avatar Jun 16 '20 08:06 vanitu

As I understand reindexing after sorting may help. df.index = Daru::Index.new(Array.new(df.size) { |i| i })

vanitu avatar Jun 16 '20 08:06 vanitu

I'm running into a similar issue that occurs when you remove rows from a dataset using filter before calling group_by - it looks like get_group does not respect non-standard indices on rows, so grouping operations will only work if your rows are indexed the default way (zero-based, consecutive integers). I don't know the Daru internals well, but the issue appears to be here: https://github.com/SciRuby/daru/blob/v0.2.2/lib/daru/core/group_by.rb#L258-L267

The conversion of @context to elements throws away @context's original indices, and references in to elements.transpose assume that the indices are the defaults (i.e. 0, 1, 2, 3, ...).

bradleybuda avatar Apr 30 '21 20:04 bradleybuda