python-calamine icon indicating copy to clipboard operation
python-calamine copied to clipboard

feat: Add Opt-in Formula Reading Support

Open PSU3D0 opened this issue 6 months ago • 4 comments

Overview

This PR introduces optional formula reading capabilities to python-calamine, allowing users to access the underlying formulas in spreadsheet cells rather than just their calculated values.

Motivation

Previously, python-calamine only provided access to cell values (the calculated results of formulas). Users had no way to access the actual formula expressions, which is essential for:

  • Spreadsheet analysis and auditing
  • Formula debugging and validation
  • Converting between different spreadsheet formats while preserving formulas
  • Building tools that need to understand the calculation logic

Implementation

Core Changes

  1. Opt-in Design: Formula reading is disabled by default (read_formulas=False) to maintain backward compatibility and avoid performance overhead for users who don't need formulas.

  2. New API Parameter: Added read_formulas parameter to all workbook creation methods:

    • CalamineWorkbook.from_object(path_or_filelike, read_formulas=False)
    • CalamineWorkbook.from_path(path, read_formulas=False)
    • CalamineWorkbook.from_filelike(filelike, read_formulas=False)
    • load_workbook(path_or_filelike, read_formulas=False)
  3. Formula Iterator: Added CalamineSheet.iter_formulas() method that returns a CalamineFormulaIterator with consistent dimensions matching the data iterator.

  4. Performance Optimization: Implemented on-demand coordinate mapping instead of pre-allocating expanded ranges, ensuring minimal memory overhead.

Technical Details

  • Coordinate Consistency: Formula iterator returns the same dimensions as data iterator, with empty strings for cells without formulas
  • Graceful Error Handling: Clear error message when attempting to access formulas without enabling the feature
  • Memory Efficient: Uses get_value() lookups rather than expanding formula ranges upfront
  • Iterator Properties: Both CalamineCellIterator and CalamineFormulaIterator expose position, start, width, and height properties for coordinate recovery

File Format Support

Supports formula reading across all major spreadsheet formats:

  • Excel (.xlsx, .xls, .xlsb)
  • OpenDocument (.ods)

Formula syntax varies by format (e.g., ODS uses of:=SUM([.A1:.B1]) vs Excel's SUM(A1:B1)).

API Examples

# Default behavior - formulas disabled
wb = load_workbook("spreadsheet.xlsx")
sheet = wb.get_sheet_by_name("Sheet1")
# sheet.iter_formulas()  # Raises ValueError

# Enable formula reading
wb = load_workbook("spreadsheet.xlsx", read_formulas=True)
sheet = wb.get_sheet_by_name("Sheet1")

# Access both data and formulas
data = list(sheet.iter_rows())      # [[10.0, 15.0, 25.0]]
formulas = list(sheet.iter_formulas())  # [["", "", "SUM(A1:B1)"]]

# Iterator properties for coordinate mapping
formula_iter = sheet.iter_formulas()
print(f"Start: {formula_iter.start}, Size: {formula_iter.width}x{formula_iter.height}")

Backward Compatibility

  • Default behavior unchanged (read_formulas=False)
  • No breaking changes to existing APIs
  • Existing code continues to work without modification

Testing

Comprehensive test suite covering:

  • All supported file formats (.xlsx, .xls, .ods)
  • All workbook creation methods
  • Error handling for disabled formulas
  • Iterator property validation
  • Coordinate consistency between data and formula iterators

Performance Impact

  • Zero overhead when formulas are disabled (default)
  • Minimal memory usage when formulas are enabled due to on-demand lookup strategy
  • No upfront range expansion or pre-allocation

PSU3D0 avatar Jul 07 '25 01:07 PSU3D0

A human addendum to the AI's diff summary ^:

This lib is much faster than openpyxl but remained unusable for my use-case due to:

  • Lack of formula support
  • Lack of coordinate recovery

This PR should be a big step forward for broader usability and more use-cases. The next thing I'd hit would be a unified CellData iterator that exposes value, coordinate, optional formula, and potentially style information (still need to check Calamine source on this).

Cheers!

PSU3D0 avatar Jul 07 '25 01:07 PSU3D0

@dimastbk bumping this

PSU3D0 avatar Jul 12 '25 03:07 PSU3D0

This would be awesome! Is there any way i can help?

benemanu avatar Aug 21 '25 13:08 benemanu

This is something we urgently need, let me know if we can help!

simx11 avatar Aug 21 '25 16:08 simx11