Azure wrongly reads Parquet
Setup
Python=3.8 + azureml-core=1.36.0 + azureml-dataprep=2.26.0 + pyarrow=7.0.0 + pandas=1.4
Summary
to_pandas_dataframe wrongly reads certain Parquet datasets. Data of some columns appears to be internally shuffled.
This was already reported but closed without a fix, due to issues with sharing data publicly.
I share the reproducible example below
How to reproduce
from azureml.core import Workspace, Dataset
import tempfile
import pandas as pd
# prepare data: list of sha-values with some None values
df = pd.read_csv('error_data.csv')
# configure Azure storage
ws = Workspace.from_config()
dstore = ws.datastores.get('your datastore')
dstore_path = 'relative datastore path'
target = (dstore,dstore_path)
# write to Azure storage
with tempfile.TemporaryDirectory() as tmpdir:
df.to_parquet(f'{tmpdir}/df.parquet')
ds=Dataset.File.upload_directory(tmpdir,target,overwrite=True)
# read by two ways: download and open in pandas or use the Azure connector
with tempfile.TemporaryDirectory() as tmpdir:
ds=Dataset.File.from_files(target)
ds.download(tmpdir)
df1 = pd.read_parquet(tmpdir)
ds = Dataset.Tabular.from_parquet_files(target)
df2 = ds.to_pandas_dataframe()
# comparison fails, the data seems displaced :-(
pd.testing.assert_frame_equal(df1,df2)
I'm encountering the same issue :((( I registered multiple parquet files as a Dataset in the workspace. When they are loaded as a dataframe using to_pandas_datafame(), the values are displaced.
@Li0425 do you also have an example to reproduce?
@maciejskorski I attempted to run your code with error_data.csv in AML studio but the issue cannot be reproduced (the two dataframes being compared are the same). The version of azureml.core that I'm using is 1.21.0
I created a dataset using AML's to_pandas_dataframe() in Oct 2021 - this dataset has displaced values. When I attempted again today, the output is actually correct. Maybe updating the version of the SDK that you are using could resolve the issue
@Li0425 you have a different config then, and older than mine (1.21 vs 1.36). You could try to reproduce in a tailored virtual env within a compute instance. Or share your own example along with a precise description of your azure libraries?
Thanks @Li0425 and @maciejskorski This is a known issue in arrow-rs crate we are depending on for the parquet reading. Good news is that it was recently fixed and we have integrated the fix in azureml-dataprep==3.0.0 you can proceed with upgrading this package in your current environement, but until azureml-core==1.40.0 is released there would be a warning about incompatibility printed out. As long as your azureml-core version is upgraded to 1.39.* the warning could be safely ignored. azureml-core release with compatible range of versions for azureml-dataprep==3.0.0 should be released next week.
@anliakho2 good news indeed. May I ask you to provide us with a reference to the technical discussion around the root-cause of this bug, like pointing to arrow-rs doc and issues? I think it would be good to see or run in case of doubts more precise tests before adapting or claiming the fix.