PyExcelerate icon indicating copy to clipboard operation
PyExcelerate copied to clipboard

Handling NaN values

Open ghost opened this issue 5 years ago • 10 comments

Thank you for this great project!

Would it be better to replace NaN values with empty string instead of #NUM! ?

ghost avatar Apr 15 '20 16:04 ghost

Thanks for the ticket. Do you have an example/use case in mind?

I'm inclined to say no, because NaN is a number type in python, thus we should translate to a number type in Excel. Additionally I'm somewhat against silently cleaning up data as the user might not expect it, but a #NUM! is rather annoying to see so I might be able to be convinced otherwise.

kevmo314 avatar Apr 15 '20 16:04 kevmo314

I have a dataset holding numeric data is some fields (unknown in advance) with some NaN values. I need to apply pd.to_numeric(data, errors='ignore') before writing to excel. If I do not do so, excel does not treat them as numeric and sometime the data is too big to format manually in excel. I cannot get rid of NaN before passing to PyExcelerate, because empty string resets dtype to object. Would be nice to have an option to choose what to do with NaN in PyExcelerate. That would fulfill needs of different scenarios.

ghost avatar Apr 15 '20 19:04 ghost

Ah I see, thanks for the information. I will think of a good API to do this. We've received a couple similar feature requests in the past, but yours lays out explicitly why the current solution isn't great.

I'm still not quite a fan of making the output directly configurable, but I think there might be a way to do this with a "data view" type of object that transforms NaN to empty string before PyExcelerate sees it. I'll think about this a little more and get a solution for you.

kevmo314 avatar Apr 15 '20 19:04 kevmo314

Thank you!

ghost avatar Apr 15 '20 19:04 ghost

Ok I've thought about this issue more and I've struggled to come up with an ergonomic API as there are a lot of edge cases and unusual behaviors. I think this might be better done through a user-implemented transform.

For example, taking a pandas.DataFrame df, we can strip out the NaN's by enumerating it into a list

data = df.values.tolist()
nans_stripped = [['' if math.isnan(value) for value in row] for row in data]

This can then be passed to PyExcelerate:

wb.new_sheet("sheet name", data=nans_stripped)

This actually might end up being faster than directly passing df, as PyExcelerate has some optimizations around list objects. We don't have any pandas-specific optimizations.

kevmo314 avatar Jul 18 '20 03:07 kevmo314

Adding to the answer by @kevmo314:

The following adaptation solved the #NUM! issue for my dataframe (which is an idiosycratic blend of numeric, NaN, and string elements). If the above solution doesn't work for the reader (as it didn't in my case), perhaps this one will.

data = df.values.tolist()
strippedNans = [['' if str(val) == 'nan' else val for val in row] for row in data]
wb.new_sheet("sheet name", data=strippedNans)

arbreazeale avatar Jul 23 '20 21:07 arbreazeale

Here is my way. What I do is, get the null columns first then use the .fillna() function from Pandas.

null_cols = latest_df.columns[latest_df.isna().any()]
latest_df[null_cols] = df[null_cols].fillna("")

amzar96 avatar Oct 19 '20 12:10 amzar96