refactor: make convert_proto_to_parquet_flatten more memory-efficient

Open SpaceDA opened this issue 3 months ago • 1 comments

What was changed

Refactored convert_proto_to_parquet_flatten for better memory efficiency and faster execution and added test coverage:

Implementation changes:

Replaced MessageToJson → MessageToDict to avoid JSON serialization overhead
Eliminated intermediate DataFrame creation and concatenation during conversion
Build single list of row dicts, then create DataFrame once with pd.json_normalize
Reduced memory usage by avoiding multiple DataFrame copies

Test coverage (10 new tests):

Unit tests for convert_proto_to_parquet_flatten using duck-typed fakes to simulate Temporal protos
Covers: basic conversion, empty executions, schema validation, column filtering
Edge cases: workflows with no events, missing attributes (documents 2 existing bugs)
Parametrized tests for multiple workflow scenarios

Why?

The original implementation created multiple intermediate DataFrames and performed expensive concat operations for each workflow, causing high memory usage and slow activity execution on large exports. The refactor builds data more efficiently while maintaining identical output.

Checklist

How was this tested: How was this tested: All 10 tests pass (uv run poe test tests/cloud_export_to_parquet/test_data_trans_activities.py -v) (see test coverage above)
Any docs updates needed? Don't think this applies.

Oct 14 '25 20:10 SpaceDA