arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Different results from pandas when groupby `prod` aggregation overflows

Open stress-tess opened this issue 1 year ago • 4 comments

@drculhane found a place where we differ from pandas when doing a groupby prod aggregation that results in values larger than can be held in an int64. Pandas seems to wrap around and we convert to a float

>>> pddf = pd.DataFrame({'col1': pd.Series(np.full(10, 10, dtype='int')), 'col2': np.full(10,2**32, dtype='int')})
>>> pddf.groupby(['col1']).prod()
      col2
col1
10       0

>>> pddf = pd.DataFrame({'col1': pd.Series(np.full(10, 10, dtype='int')), 'col2': np.full(10,2**32 + 1, dtype='int')})
>>> pddf.groupby(['col1']).prod()
             col2
col1
10    42949672961

>>> akdf = ak.DataFrame({'col1': ak.full(10, 10, dtype='int'), 'col2': ak.full(10,2**32, dtype='int')})
>>> akdf.groupby('col1').prod()
              col2
col1
10    2.135987e+96 (1 rows x 1 columns)

>>> akdf = ak.DataFrame({'col1': ak.full(10, 10, dtype='int'), 'col2': ak.full(10,2**32 + 1, dtype='int')})
>>> akdf.groupby('col1').prod()
              col2
col1
10    2.135987e+96 (1 rows x 1 columns)

stress-tess avatar Apr 02 '24 19:04 stress-tess

I did a tiny bit of digging and this seems intentional... the question is why did we make this decision? https://github.com/Bears-R-Us/arkouda/blob/a84a35f5a734e85d36dd68f27b33c6c983d5dc02/arkouda/groupbyclass.py#L750-L752

https://github.com/Bears-R-Us/arkouda/blob/a84a35f5a734e85d36dd68f27b33c6c983d5dc02/src/ReductionMsg.chpl#L995-L1007

I don't really know why we always make the return type for segProd to be real? perhaps it's for this very reason to avoid overflows since we work with very large arrays?

stress-tess avatar Apr 02 '24 20:04 stress-tess

It seems it got changed back in PR https://github.com/Bears-R-Us/arkouda/pull/59, but there's no note of it in the description

stress-tess avatar Apr 02 '24 21:04 stress-tess

If casting to float is just to use the log trick, can we just cast back to int at the last step?

ajpotts avatar Apr 24 '24 22:04 ajpotts

Good thought, and I’m a bit unhappy with the log trick anyway. I don’t see why that’s a good thing, and I can picture situations where it’s a bad idea.

—- Andy

Andrew D. Culhane, Ph.D. Sagecor Solutions LLC

On Wed, Apr 24, 2024 at 6:14 PM ajpotts @.***> wrote:

If casting to float is just to use the log trick, can we just cast back to int at the last step?

— Reply to this email directly, view it on GitHub https://github.com/Bears-R-Us/arkouda/issues/3073#issuecomment-2075941639, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGL3SKS57WFNYDAEIXBOBMDY7AVD3AVCNFSM6AAAAABFT76FGOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZVHE2DCNRTHE . You are receiving this because you were mentioned.Message ID: @.***>

drculhane avatar Apr 24 '24 22:04 drculhane