Improve DataFrame Access Performance

Open brandon-neth opened this issue 1 year ago • 0 comments

With the PR to match Arkouda DataFrame indexing to Pandas (#3109), the indexing methods now return Series (data plus their indices in the DataFrame) rather than just the data. This has led to performance regressions for both the indexing methods and other methods that rely on them.

I'd like to open a discussion of how to address this performance problem. Some of the possible directions this could go:

Diverge from Pandas API and return data without the associated index
Make the return type configurable on the existing indexing methods
Add additional indexing methods that only return the data

Some questions I have:

Do current arkouda users have any use cases where the row indices are an important part of their application's functionality?
What criteria should we use to evaluate whether it's acceptable to deviate from the Pandas (or Numpy, or any other existing library) API?

However we proceed, I think this will also help with the unfortunate pattern that the indexing PR introduced of having to access the .values field of the indexing result to get to the data.

Jun 06 '24 17:06 brandon-neth