feat: Add Opt-in Formula Reading Support
Overview
This PR introduces optional formula reading capabilities to python-calamine, allowing users to access the underlying formulas in spreadsheet cells rather than just their calculated values.
Motivation
Previously, python-calamine only provided access to cell values (the calculated results of formulas). Users had no way to access the actual formula expressions, which is essential for:
- Spreadsheet analysis and auditing
- Formula debugging and validation
- Converting between different spreadsheet formats while preserving formulas
- Building tools that need to understand the calculation logic
Implementation
Core Changes
-
Opt-in Design: Formula reading is disabled by default (
read_formulas=False) to maintain backward compatibility and avoid performance overhead for users who don't need formulas. -
New API Parameter: Added
read_formulasparameter to all workbook creation methods:-
CalamineWorkbook.from_object(path_or_filelike, read_formulas=False) -
CalamineWorkbook.from_path(path, read_formulas=False) -
CalamineWorkbook.from_filelike(filelike, read_formulas=False) -
load_workbook(path_or_filelike, read_formulas=False)
-
-
Formula Iterator: Added
CalamineSheet.iter_formulas()method that returns aCalamineFormulaIteratorwith consistent dimensions matching the data iterator. -
Performance Optimization: Implemented on-demand coordinate mapping instead of pre-allocating expanded ranges, ensuring minimal memory overhead.
Technical Details
- Coordinate Consistency: Formula iterator returns the same dimensions as data iterator, with empty strings for cells without formulas
- Graceful Error Handling: Clear error message when attempting to access formulas without enabling the feature
-
Memory Efficient: Uses
get_value()lookups rather than expanding formula ranges upfront -
Iterator Properties: Both
CalamineCellIteratorandCalamineFormulaIteratorexposeposition,start,width, andheightproperties for coordinate recovery
File Format Support
Supports formula reading across all major spreadsheet formats:
- Excel (.xlsx, .xls, .xlsb)
- OpenDocument (.ods)
Formula syntax varies by format (e.g., ODS uses of:=SUM([.A1:.B1]) vs Excel's SUM(A1:B1)).
API Examples
# Default behavior - formulas disabled
wb = load_workbook("spreadsheet.xlsx")
sheet = wb.get_sheet_by_name("Sheet1")
# sheet.iter_formulas() # Raises ValueError
# Enable formula reading
wb = load_workbook("spreadsheet.xlsx", read_formulas=True)
sheet = wb.get_sheet_by_name("Sheet1")
# Access both data and formulas
data = list(sheet.iter_rows()) # [[10.0, 15.0, 25.0]]
formulas = list(sheet.iter_formulas()) # [["", "", "SUM(A1:B1)"]]
# Iterator properties for coordinate mapping
formula_iter = sheet.iter_formulas()
print(f"Start: {formula_iter.start}, Size: {formula_iter.width}x{formula_iter.height}")
Backward Compatibility
- Default behavior unchanged (
read_formulas=False) - No breaking changes to existing APIs
- Existing code continues to work without modification
Testing
Comprehensive test suite covering:
- All supported file formats (.xlsx, .xls, .ods)
- All workbook creation methods
- Error handling for disabled formulas
- Iterator property validation
- Coordinate consistency between data and formula iterators
Performance Impact
- Zero overhead when formulas are disabled (default)
- Minimal memory usage when formulas are enabled due to on-demand lookup strategy
- No upfront range expansion or pre-allocation
A human addendum to the AI's diff summary ^:
This lib is much faster than openpyxl but remained unusable for my use-case due to:
- Lack of formula support
- Lack of coordinate recovery
This PR should be a big step forward for broader usability and more use-cases. The next thing I'd hit would be a unified CellData iterator that exposes value, coordinate, optional formula, and potentially style information (still need to check Calamine source on this).
Cheers!
@dimastbk bumping this
This would be awesome! Is there any way i can help?
This is something we urgently need, let me know if we can help!