arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

fix or clarify general/widespread typing issues with inclusion of akutil DataFrame and Series objects

Open kellyjoy15 opened this issue 3 years ago • 4 comments

Ideally, all functions should be able to take Series objects as input and apply the function to the underlying arkouda array (pdarray, categorical, strings, etc.)

Right now, it seems there are 3 kinds of functions: arkouda_style functions/methods: that only take "raw" arkouda arrays as input pandas-style functions/methods: that take DataFrame or Series objects as input hybrids that take either one?

If there is going to be a split between arkouda-style functions and pandas-style functions/methods, it needs to be obvious which is which. Otherwise, arkouda should be able to figure out to look at the values attribute of the series I give it as input.

kellyjoy15 avatar May 03 '22 21:05 kellyjoy15

In some instances having the varying styles is intentional, but I would venture to say not all. At a minimum I would agree that it should at least be clear and we should document what functions are designed to work in what way if this variance will continue. I am also of the opinion that a review of the functionality in question and possible updates may be warranted in some situations.

@kellyjoy15 - can you give a specific function that you are working with that you would like to pass in a Series and have the computation performed on the values?

Ethan-DeBandi99 avatar May 05 '22 13:05 Ethan-DeBandi99

I'd also refer you to my reproducer for #1347 to show the general confusion over various options, noting different error messages for almost every attempt.

Here's another thing to try:

a = ak.array([True, False, True, True])
d = ak.array([2,2,3,5])
df = ak.DataFrame({'a':a,'d':d})

# In pandas, I would do the following:
df.d.value_counts()
# AttributeError: 'pdarray' object has no attribue 'a'

df['d'].value_counts()
# AttributeError: 'pdarray' object has no attribute 'value_counts'

ak.Series(df['d']).value_counts()
# TypeError: type of argument 'pda' must be arkouda.pdarrayclass.pdarray; got numpy.int64 instead

ak.value_counts(ak.Series(df['d']))
# TypeError: type of argument "pda" must be arkouda.pdarrayclass.pdarray; got arkouda.series.Series instead

ak.value_counts(df['d'])
# got something finally, but it's not the format I wanted

ak.Series(ak.value_counts(df['d']))

I might also try this with an IPv4 column:

df['ip'].value_counts()
# AttributeError: 'IPv4' has no attribute 'value_counts'

ak.Series(df['ip']).value_counts()
# AttributeError: 'str' object has no attribute 'size'

ak.value_counts(df['ip'])
# OK. I got something, but not in the format I wanted

ak.Series(ak.value_counts(df['ip']))
# ugh. Now it's a series, but it's forgotten that it was an IPv4

s = ak.Series(ak.value_counts(df['ip']))
idx, counts = s.index, s.values
new_index = ak.IPv4(idx)
# TypeError: Argument must be int64 pdarray

idx.dtype
# AttributeError: 'Index' object has no attribute 'dtype'

idx.values.dtype
# AttributeError: 'Index' object has no attribute 'values'

idx.index.dtype
# dtype('uint64')

idx_vals = ak.cast(idx.index, dtype='int64')
# TypeError: missing required argument: 'dt'

idx_vals = ak.cast(idx.index, dt='int64')
new_index = ak.IPv4(idx_vals)
ak.Series((new_index, counts))
# finally

kellyjoy15 avatar May 05 '22 15:05 kellyjoy15

I don't know what the solution is, but I am trying to illustrate that it has become incredibly painful to do most things. It seemed much better (to me) when the logic of DataFrames and Series was logically separate from arkouda proper.

One potential solution would be that df['columnname'] is a Series object, as in pandas. Then, I could just work in DataFrame/Series land instead of having to go back and forth all of the time. Special types (ips, bitvectors, etc) could be dealt with at the Series level. Right now, there are (at least) three levels of things: the underlying numeric data (dtype= int64, float, uint64, bool, etc.), the "special" types like bitvector and IPv4, and the Series/DataFrame interfaces. Perhaps some rethinking and refactoring is in order... I see so many unique error messages coming (I assume) from varying levels of data representation that something needs to be done.

kellyjoy15 avatar May 05 '22 15:05 kellyjoy15

Hey @kellyjoy15, I'm sorry this has been so frustrating! I'm hoping this will work more the way you expect once #1363 is resolved

Reading through your example it does seem like it would be nice for some of the imported ak.util functionality (Series, Index) to be given some attention. Maybe we could update more functions to allow Series as input (adding it to groupable?)

but for now, two ways to get the information you want are

>>> a = ak.array([True, False, True, True])
>>> d = ak.array([2,2,3,5])
>>> df = ak.DataFrame({'a':a,'d':d})
>>> ak.Series((ak.arange(df['d'].size), df['d'])).value_counts()
2    2
3    1
5    1
dtype: int64

>>> ak.GroupBy(df['d']).count()
(array([2 3 5]), array([2 1 1]))

i believe something similar would likely work for the second example but it's hard to say without having df['ip'] to verify

Related to #1347 and #1363

stress-tess avatar May 05 '22 22:05 stress-tess