hdfs3 icon indicating copy to clipboard operation
hdfs3 copied to clipboard

pandas DataFrame.to_parquet get killed while writing to hdfs3

Open eromoe opened this issue 7 years ago • 5 comments

My code like below


from hdfs3 import HDFileSystem

hdfs = HDFileSystem(host=HDFS_HOST, port=HDFS_PORT)
df = hdfs.read_stockquantitylogs(input_path)
df = process_df(df, process_stock_quantity_log, stack_hour=False)

output_path = input_path.replace('/arch', '/clean', 1)

hdfs.makedirs(dirname(output_path))
with hdfs.open(output_path, 'wb') as f:
    df.to_parquet(f)

I didn't use dask for now, it is pandas. Here df is [31909929 rows x 3 columns] , I fond if I write 1000, it works. But print Killed when write the whole df.

eromoe avatar Aug 23 '18 02:08 eromoe

I change to put , still get killed.

eromoe avatar Aug 23 '18 06:08 eromoe

I change to https://hdfscli.readthedocs.io/ , error gone. Though I have to save file in local then upload to hdfs.

eromoe avatar Aug 23 '18 08:08 eromoe

It is hard to diagnose what "killed" might mean - presumably some buffer overrun in the C layer. It would be interesting to know the data size of the file you are trying to write. You might want to try arrow's hdfs interface, which seems to be less error prone.

martindurant avatar Aug 23 '18 13:08 martindurant

Can't hdfs3 catch such errors ?

eromoe avatar Aug 24 '18 05:08 eromoe

This message is not being created by hdfs3, but by the OS when the process does something illegal. It will be in the C layer, because python produces nicer messages, and so there is no opportunity for python to catch it. You could perhaps invoke gdb to find out what, but such investigations are very hard.

martindurant avatar Aug 24 '18 12:08 martindurant