umya-spreadsheet icon indicating copy to clipboard operation
umya-spreadsheet copied to clipboard

[discussion] lazy load spreadsheet feature(s) for speed and memory footprint?

Open agentjill opened this issue 2 years ago • 11 comments

As much as I have seen I think that the entire Excel file props are loaded into memory to read and write. To optimize performance, I suggest selectively loading and manipulating data in memory rather than the entire dataset. This could boost efficiency, especially for larger files and scenarios where only specific data is modified. I think most of the workloads are based on using a limited feature set anyway.

I'd appreciate a good discussion on this proposal.

agentjill avatar Dec 04 '23 08:12 agentjill

For systems that support it (I suppose systems that work with Excel files should), the entire file can simply be mem-mapped into memory and accessed like it has been loaded. That might be easier to handle than to convert into a stream-based API...

schungx avatar Dec 04 '23 08:12 schungx

@schungx you raise a very valid point, but i think this library is not implemented in that way. The spreadsheet is decoded into memory props which is then manipulated as needed. And on write, the props stored are written to file.

agentjill avatar Dec 05 '23 11:12 agentjill

True. I am not knowledgeable enough about the Excel file format to know how it is done, except that it is XML-based and zipped (probably compressed). With that, it may not be possible to stream it without reconstructing the whole thing in memory first, because essential data structures may reside at the very end...

schungx avatar Dec 05 '23 12:12 schungx

@schungx Yes but take in the use case of just reading the excel range into memory as vec of strings. For that it would be waste of resource to load in the styles, protections, tables etc. of that cell in the memory as it was not requested by the user. Also, the said are located in another xml file. Therefore, just retrieving the data by decoding in time is better for memory in my opinion. Entire loading is also a use case which user may specify for repeated file editing.

agentjill avatar Dec 06 '23 00:12 agentjill

Perhaps flags to control what info should be loaded. That would give more flexibility.

schungx avatar Dec 06 '23 03:12 schungx

It implements lazy_read(path) in case a large file is to be edited. This implementation allows the user to specify when to deserialize on a per-sheet basis.

// Read raw data (not deserialized)
let path = std::path::Path::new("./tests/test_files/aaa.xlsx");
let mut book = umya_spreadsheet::reader::xlsx::lazy_read(path).unwrap();

// Deserialize the first sheet
let mut sheet = book.get_sheet_mut(&0).unwrap();

A first stepping stone could be, for example, the following renovations

  1. reading of Excel files is done by lazy_read(path).
  2. The system automatically deserializes each portion when needed.
  3. a method to deserialize all parts is also provided.

MathNya avatar Dec 06 '23 03:12 MathNya

@MathNya, i felt lazy_read(path) is lacking sufficient documentation for exact use case. I was using book.get_sheet(&0). unwrap() and getting error.

Also, once the sheet is loaded that would mean things like table, protection etc are also loaded into memory props. I was meaning for that to be loaded lazily.

I am open for discussion on this issue. Thanks for responding

agentjill avatar Dec 06 '23 03:12 agentjill

Or some form of a builder API:

let excel_reader = ExcelReaderBuilder::new().with_values().with_formulas().with_pictures().build();

// The following reads the Excel file Book.xlsx, but only with values, formulas and pictures.
// Other stuff like formatting, drawings, etc. are skipped (i.e. default values).
// It would be like the sheet is copied and "Paste-Special" onto a blank new sheet with only values and formulas.
let file = excel_reader.read("Book.xlsx").unwrap();

schungx avatar Dec 06 '23 04:12 schungx

@agentjill You are right. lazy_read(path) is a bit quirky. I have felt the need to improve the documentation.

Also, once the sheet is loaded that would mean things like table, protection etc are also loaded into memory props. I was meaning for that to be loaded lazily.

It is wonderful. I agree with the lazy load subdivision.

MathNya avatar Dec 06 '23 06:12 MathNya

[You can] use the parallel programming, Luke! worksheets are almost independent

pub fn read_reader<R: io::Read + io::Seek>( reader: R, with_sheet_read: bool, ) -> Result<Spreadsheet, XlsxError> { ... for (_, type_value, rel_target) in &workbook_rel { // can it be parallel? ... }

mjhmjhgkmj avatar Apr 16 '24 19:04 mjhmjhgkmj

@mjhmjhgkmj Very nice idea. The sections you mentioned can be processed in parallel. (Perhaps there are other areas where parallel processing is possible.)

MathNya avatar Apr 17 '24 02:04 MathNya