samples-python
samples-python copied to clipboard
refactor: make convert_proto_to_parquet_flatten more memory-efficient
What was changed
Refactored convert_proto_to_parquet_flatten for better memory efficiency and faster execution and added test coverage:
Implementation changes:
- Replaced MessageToJson → MessageToDict to avoid JSON serialization overhead
- Eliminated intermediate DataFrame creation and concatenation during conversion
- Build single list of row dicts, then create DataFrame once with pd.json_normalize
- Reduced memory usage by avoiding multiple DataFrame copies
Test coverage (10 new tests):
- Unit tests for convert_proto_to_parquet_flatten using duck-typed fakes to simulate Temporal protos
- Covers: basic conversion, empty executions, schema validation, column filtering
- Edge cases: workflows with no events, missing attributes (documents 2 existing bugs)
- Parametrized tests for multiple workflow scenarios
Why?
The original implementation created multiple intermediate DataFrames and performed expensive concat operations for each workflow, causing high memory usage and slow activity execution on large exports. The refactor builds data more efficiently while maintaining identical output.
Checklist
-
How was this tested: How was this tested: All 10 tests pass (uv run poe test tests/cloud_export_to_parquet/test_data_trans_activities.py -v) (see test coverage above)
-
Any docs updates needed? Don't think this applies.