modin
modin copied to clipboard
`loc` on MultiIndex row & column DataFrame returns wrong view of data
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 12.2.1
-
Modin version (
modin.__version__): 0.15.2 - Python version: 3.9.12
- Code we can use to reproduce:
import modin.pandas as pd
import pandas
multi_index = pd.MultiIndex.from_tuples(
[("r0", "rA"), ("r1", "rB")], names=["Courses", "Fee"]
)
cols = pd.MultiIndex.from_tuples(
[
("Gasoline", "Toyota"),
("Gasoline", "Ford"),
("Electric", "Tesla"),
("Electric", "Nio"),
]
)
data = [[100, 300, 900, 400], [200, 500, 300, 600]]
df = pd.DataFrame(data, columns=cols, index=multi_index)
pdf = pandas.DataFrame(data, columns=cols, index=multi_index)
pdf.loc[("r0"), ("Gasoline", "Toyota")]
df.loc[("r0"), ("Gasoline", "Toyota")] # Returns wrong value
Describe the problem
pandas returns this output:
Fee
rA 100
Name: (Gasoline, Toyota), dtype: int64
whereas Modin gives us this:
100
The root of the issue seems to be in how we are handling loc lookups for MultiIndex DataFrames in the first place. The root of the issue seems to be different from the other MultiIndex issues we've had to deal with.
Source code / logs
I did a bit of investigation here and it looks like ndim is 0 for this case, which appears to be the incorrect calculation. We may be having the incorrect abstraction for the MultiIndex case.
I also come up same problem.
df2 = pd.DataFrame(np.arange(16).reshape(-1, 4), index=pd.MultiIndex.from_tuples(zip(list('zxzx'), [0,1,2,4]), names=['qq','ww']), columns=list('abcd'))
df3 = df2.unstack(0)
# modin
df3.loc[:, ['a']]
# pandas
df3._to_pandas().loc[:, ['a']]
# modin
df3.loc[:, [('a', 'x'), ('a', 'z')]]
# pandas
df3._to_pandas().loc[:, [('a', 'x'), ('a', 'z')]]
# modin
df3[['a']]
# pandas
df3._to_pandas()[['a']]


