modin icon indicating copy to clipboard operation
modin copied to clipboard

`loc` on MultiIndex row & column DataFrame returns wrong view of data

Open pyrito opened this issue 3 years ago • 1 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 12.2.1
  • Modin version (modin.__version__): 0.15.2
  • Python version: 3.9.12
  • Code we can use to reproduce:
import modin.pandas as pd
import pandas

multi_index = pd.MultiIndex.from_tuples(
    [("r0", "rA"), ("r1", "rB")], names=["Courses", "Fee"]
)
cols = pd.MultiIndex.from_tuples(
    [
        ("Gasoline", "Toyota"),
        ("Gasoline", "Ford"),
        ("Electric", "Tesla"),
        ("Electric", "Nio"),
    ]
)
data = [[100, 300, 900, 400], [200, 500, 300, 600]]
df = pd.DataFrame(data, columns=cols, index=multi_index)
pdf = pandas.DataFrame(data, columns=cols, index=multi_index)

pdf.loc[("r0"), ("Gasoline", "Toyota")] 
df.loc[("r0"), ("Gasoline", "Toyota")] # Returns wrong value

Describe the problem

pandas returns this output:

Fee
rA    100
Name: (Gasoline, Toyota), dtype: int64

whereas Modin gives us this:

100

The root of the issue seems to be in how we are handling loc lookups for MultiIndex DataFrames in the first place. The root of the issue seems to be different from the other MultiIndex issues we've had to deal with.

Source code / logs

pyrito avatar Jul 19 '22 14:07 pyrito

I did a bit of investigation here and it looks like ndim is 0 for this case, which appears to be the incorrect calculation. We may be having the incorrect abstraction for the MultiIndex case.

pyrito avatar Jul 19 '22 14:07 pyrito

I also come up same problem.

df2 = pd.DataFrame(np.arange(16).reshape(-1, 4), index=pd.MultiIndex.from_tuples(zip(list('zxzx'), [0,1,2,4]), names=['qq','ww']), columns=list('abcd'))
df3 = df2.unstack(0)

# modin
df3.loc[:, ['a']]
# pandas
df3._to_pandas().loc[:, ['a']]

# modin
df3.loc[:, [('a', 'x'), ('a', 'z')]]

# pandas
df3._to_pandas().loc[:, [('a', 'x'), ('a', 'z')]]

# modin
df3[['a']]

# pandas
df3._to_pandas()[['a']]

image

image

image

eromoe avatar Dec 08 '22 03:12 eromoe