PyExcelerate icon indicating copy to clipboard operation
PyExcelerate copied to clipboard

performance when serializing pandas DataFrames

Open sdementen opened this issue 4 years ago • 11 comments

the function __get_cell_data (https://github.com/kz26/PyExcelerate/blob/dev/pyexcelerate/Worksheet.py#L227) operates on each cell individually. when serializing a pandas.DataFrame, most of the time, the columns are of a unique type (dtype) and could benefit from some "columnar" approach (instead of row by row, cell by cell approach) to speed up things:

  • the ´if´ statements could be evaluated only once per column
  • the conversion to string/xml could leverage some "apply / applymap" from pandas
  • ... have you already thought about ways to improve this by keeping the "columnar" info further down the pipe (vs transforming everything to cells) for DataFrames ? it is quite specific yet it is a case lot of pandas users are hitting (slowness in exporting to excel).

sdementen avatar Mar 11 '21 05:03 sdementen

Can you link to some discussions that this is something pandas users are hitting? I would like to see use cases where users are directly exporting data (ie not touching the pandas dataframe) instead of modifying the data frame before exporting. I have seen this come up before so this makes me wonder if it's worth having a "direct to xlsx" function that encourages not touching the data.

Superficially the feature sounds reasonable though and a nice pinhole optimization. We have something similar where a bulk import of data stores the data as a dense table and then future edits are based off this initial table. I could see the feature here being if the passed data table is typed, then we store the column types and null them out if cells are edited of a different type, then using this as a lookup table.

I think we wouldn't be able to leverage apply/applymap though, I think that is too specific and would dramatically affect how PyExcelerate stores data.

kevmo314 avatar Mar 11 '21 06:03 kevmo314

You can google like https://www.google.com/search?safe=active&client=ms-android-samsung-ss&sxsrf=ALeKk0148RZ-MDnFz2BlbdOKsijf-Lnbbg%3A1615445522653&ei=Er5JYOqaJ5D7kwXBs6ToAw&q=pandas+to_excel+performance&oq=pandas+to_excel+performance

What do you mean by "modifying the data frame before exporting"? Typically, a user (like me) works with data frames and wishes to save the final result to excel so other users or himself can simply open excel to see the export (or offer a "download excel" in a webapp/dashboard).

sdementen avatar Mar 11 '21 06:03 sdementen

If what you mean is that the user does not export directly a data frame to excel but writes it to excel and then modify the excel before saving, it is indeed not the case here. The user uses simply the DataFrame.to_excel function that writes directly the excel file and then open it in excel.

sdementen avatar Mar 11 '21 07:03 sdementen

For an approach based on a dense data table+a layer to capture the changes (this is what you do currently if I read correctly between your lines), 8t would be great to have it also using a day frame as "dense table" yet for the specific case of proposing a faster alternative to the to_excel function, it is not required from day one. I like your thinking in terms of fastest,faster,fast,slow and having a nice degradation of performance in function of what the user needs as features. Here it would look more like a ultrafast(works only on data frames) that would fall back on the fastest track if the user needs more than the standard excel export feature for dataframes. And in this case, the dataframes could be simply converted to the current dense table and use the current system.

sdementen avatar Mar 11 '21 08:03 sdementen

Some discussion on the topic of excel/python performance https://github.com/tfussell/xlnt/issues/184

sdementen avatar Mar 11 '21 08:03 sdementen

If what you mean is that the user does not export directly a data frame to excel but writes it to excel and then modify the excel before saving, it is indeed not the case here. The user uses simply the DataFrame.to_excel function that writes directly the excel file and then open it in excel.

My understanding is that pandas doesn't support PyExcelerate as a writing engine?

kevmo314 avatar Mar 11 '21 16:03 kevmo314

Indeed, pandas does not support pyexcelerate as engine. I have written a simple to_excel alike functionality based on pyexcelerate that is more than twice faster than pandas best engine and I would hope that we some tweaking on pyexcelerate, we could speed this up by another factor of 2. Preparing an excel file in 2 seconds vs 8 seconds makes a huge difference in user experience (for instance if the user clicks on a "download excel" button in an app or webapp)

sdementen avatar Mar 11 '21 18:03 sdementen

I am not too keen on adding pandas-specific functionality because PyExcelerate isn't an officially supported engine. Ultimately PyExcelerate is used a lot outside pandas as well, and integrating pandas is a very heavy dependency for a use case that they do not even directly support.

I think adding type hints for columns is appropriate, I'll look into a way to add those soon. But given that pandas does not support PyExcelerate, going in that direction would pigeonhole us too much into a specific use case.

kevmo314 avatar Mar 11 '21 18:03 kevmo314

Ok. I will try in parallel a proof of concept with pandas to see if it brings the speed benefits I expect and will keep you uodated

sdementen avatar Mar 12 '21 05:03 sdementen

I quick test (not 100% bulletproof) shows a 7.5x speedup (reality when finished may be a bit lower) when taking a columnar approach & leveraging pandas.apply. very promising and a game changer for pandas excel exports !

sdementen avatar Mar 13 '21 18:03 sdementen

Right, I understand, but PyExcelerate is not a pandas to Excel library. Additionally, the pandas team has indicated they don't wish to add PyExcelerate support. I am not particularly inclined to couple PyExcelerate to pandas for a very specific use case especially without any official pandas integration.

As I mentioned above, we can add column type hints, but as I think about this I think it will be less useful than it seems because column headers will ruin the type consistency of the data table. Additionally, the type hint has to be computed at data insertion time anyways, so it wouldn't yield much speedup.

I'm more than happy to work with the pandas team to add integration, but without their support this isn't a good idea for the library despite the promising speedup.

kevmo314 avatar Mar 13 '21 18:03 kevmo314