MachineLearningNotebooks icon indicating copy to clipboard operation
MachineLearningNotebooks copied to clipboard

Azure wrongly reads Parquet

Open maciejskorski opened this issue 3 years ago • 6 comments

Setup

Python=3.8 + azureml-core=1.36.0 + azureml-dataprep=2.26.0 + pyarrow=7.0.0 + pandas=1.4

Summary

to_pandas_dataframe wrongly reads certain Parquet datasets. Data of some columns appears to be internally shuffled. This was already reported but closed without a fix, due to issues with sharing data publicly. I share the reproducible example below

How to reproduce

from azureml.core import Workspace, Dataset
import tempfile
import pandas as pd

# prepare data: list of sha-values with some None values
df = pd.read_csv('error_data.csv')

# configure Azure storage
ws = Workspace.from_config()
dstore = ws.datastores.get('your datastore')
dstore_path = 'relative datastore path'
target = (dstore,dstore_path)

# write to Azure storage
with tempfile.TemporaryDirectory() as tmpdir:
    df.to_parquet(f'{tmpdir}/df.parquet')
    ds=Dataset.File.upload_directory(tmpdir,target,overwrite=True)

# read by two ways: download and open in pandas or use the Azure connector
with tempfile.TemporaryDirectory() as tmpdir:
    ds=Dataset.File.from_files(target)
    ds.download(tmpdir)
    df1 = pd.read_parquet(tmpdir)
    ds = Dataset.Tabular.from_parquet_files(target)
    df2 = ds.to_pandas_dataframe()

# comparison fails, the data seems displaced :-(
pd.testing.assert_frame_equal(df1,df2)

error_data.csv

maciejskorski avatar Mar 14 '22 15:03 maciejskorski

I'm encountering the same issue :((( I registered multiple parquet files as a Dataset in the workspace. When they are loaded as a dataframe using to_pandas_datafame(), the values are displaced.

Li0425 avatar Mar 18 '22 09:03 Li0425

@Li0425 do you also have an example to reproduce?

maciejskorski avatar Mar 18 '22 17:03 maciejskorski

@maciejskorski I attempted to run your code with error_data.csv in AML studio but the issue cannot be reproduced (the two dataframes being compared are the same). The version of azureml.core that I'm using is 1.21.0

I created a dataset using AML's to_pandas_dataframe() in Oct 2021 - this dataset has displaced values. When I attempted again today, the output is actually correct. Maybe updating the version of the SDK that you are using could resolve the issue

Li0425 avatar Mar 21 '22 02:03 Li0425

@Li0425 you have a different config then, and older than mine (1.21 vs 1.36). You could try to reproduce in a tailored virtual env within a compute instance. Or share your own example along with a precise description of your azure libraries?

maciejskorski avatar Mar 21 '22 05:03 maciejskorski

Thanks @Li0425 and @maciejskorski This is a known issue in arrow-rs crate we are depending on for the parquet reading. Good news is that it was recently fixed and we have integrated the fix in azureml-dataprep==3.0.0 you can proceed with upgrading this package in your current environement, but until azureml-core==1.40.0 is released there would be a warning about incompatibility printed out. As long as your azureml-core version is upgraded to 1.39.* the warning could be safely ignored. azureml-core release with compatible range of versions for azureml-dataprep==3.0.0 should be released next week.

anliakho2 avatar Mar 23 '22 18:03 anliakho2

@anliakho2 good news indeed. May I ask you to provide us with a reference to the technical discussion around the root-cause of this bug, like pointing to arrow-rs doc and issues? I think it would be good to see or run in case of doubts more precise tests before adapting or claiming the fix.

maciejskorski avatar Mar 24 '22 08:03 maciejskorski