EPPlus icon indicating copy to clipboard operation
EPPlus copied to clipboard

Asynchronously stream worksheet rows with IAsyncEnumerable without loading the entire worksheet into memory

Open virzak opened this issue 11 months ago • 8 comments

Is your feature request related to a problem? Please describe. We have multiple, potentially huge files with multiple worksheets that we need to process. Rows are sorted in the order that we need. Currently the only way to read a row is to load the entire file with:

using (var package = new ExcelPackage());
await package.LoadAsync(file);

or

using (var package = new ExcelPackage(file));

This operation can take a long time. It doesn't have System.IProgress argument, so the user is left hanging. It can also take a huge amount of memory.

Describe the solution you'd like

We need a solution which allows the user to read a single row without loading the entire file into the memory. In our case, we don't even need to jump between arbitrary rows, we simply need the rows consecutively. A signature could look like:

async IAsyncEnumerable<ExcelRange[]> ExcelWorksheet.GetAllRowsAsync()

This operation should return start returning rows immediately without loading nothing into except only the buffer needed to stream rows. Merged cells would only be returned once.

virzak avatar Feb 14 '25 01:02 virzak

This is currently not supported, but we might look at a new stream based solution that gives faster access. However, a solution like this will be limited, as there are a lot of dependencies on individual cells, with styles, formulas and and more, that requires the entire workbook to be loaded. I'll add this request as an enhancement.

JanKallman avatar Feb 14 '25 07:02 JanKallman

One way to look at this ask is to ensure the developer doesn't have to resort to csv file format just because its rows can be easily streamed. We would be using the streaming solution when dependencies aren't present or could be ignored.

virzak avatar Feb 14 '25 12:02 virzak

I did a little bit of research into other excel libraries. It seems like all other .NET libraries load the entire file into memory before allowing retrieval.

Python library openpyxl does provide a way to retrieve rows right away.

import openpyxl

wb = openpyxl.load_workbook("C:/mypath.xlsx", read_only=True)
sheet = wb.active
for row in sheet.iter_rows(values_only=True):
    print(row)

@JanKallman, just wanted to get a feeling for when you think this might be realistically implemented. No pressure, but if you think this is a couple of years away, we do have at least this alternative with Python.

virzak avatar Feb 24 '25 14:02 virzak

@virzak, We have this on our enhancement list, but it has not been planned or estimated yet. We will start to plan new features after EPPlus 8 has been release, so if we decide to start developing this I can give you a better estimate for when it will be available and what features it will contain then.

JanKallman avatar Mar 05 '25 08:03 JanKallman

This library can achieve reading range rows without the need to load them into memory as a whole: https://github.com/mini-software/MiniExcel

zhuangState avatar Jun 08 '25 04:06 zhuangState

目前不支持此功能,但我们可能会考虑一种新的基于流的解决方案,以提供更快的访问。但是,像这样的解决方案将是有限的,因为对单个单元格有很多依赖性,包括样式、公式等,这需要加载整个工作簿。我将添加此请求作为增强功能。

Hi, we may need to only read data to improve performance

zhuangState avatar Jun 08 '25 04:06 zhuangState

Thanks @zhuangState. MiniExcel is fast querying the first row, but querying subsequent rows gets slower and slower.

IAsyncEnumerable is not currently supported either: https://github.com/mini-software/MiniExcel/issues/529

virzak avatar Jun 09 '25 14:06 virzak

IAsyncEnumerable work has been merged and should be available in the next major version. If anyone wants to implement IAsyncEnumerable in this library, linking this here: https://github.com/mini-software/MiniExcel/pull/799

virzak avatar Jun 18 '25 15:06 virzak