numaflow-python fix(sync-server): thread-safe shutdown and error reporting in SyncMapServicer

This PR fixes a race condition in the sync server where only the first error from multiple crashing threads was reported, and others were lost. Now, only the first error triggers shutdown and error reporting; others are ignored after shutdown starts. Removed temporary test files used for debugging. Closes #198 (Graceful shutdown of gRPC servers when there are exceptions in the User Code). I’m still exploring gRPC, so I may be wrong—open to any feedback or suggestions!

Jul 18 '25 02:07 sapkota-aayush

Codecov Report

Attention: Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.

Project coverage is 94.09%. Comparing base (42f9fbd) to head (4ed0cf5). Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pynumaflow/mapper/_servicer/_sync_servicer.py	73.68%	3 Missing and 2 partials :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #236      +/-   ##
==========================================
- Coverage   94.26%   94.09%   -0.17%     
==========================================
  Files          60       60              
  Lines        2441     2457      +16     
  Branches      124      128       +4     
==========================================
+ Hits         2301     2312      +11     
- Misses        101      104       +3     
- Partials       39       41       +2

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jul 18 '25 02:07 codecov[bot]

@kohlisid @vigith After applying the changes, I ran the race condition tests. The results show that only the first error triggers shutdown and error reporting, and all other errors are ignored after shutdown starts—just as intended. I may be mistaken, though. I’m waiting for your feedback!

Jul 18 '25 02:07 sapkota-aayush

@sapkota-aayush Would you want to test few scenarios with fmea.

Scaledown events where pods are killed
Panic is the user code (random)
Panic in the user code (consistent)

Also want to note down the behaviour of the events post the shutdown/restart

Are the pods coming back up seamless, or there are issues in server startup
Are the events that were left mid way in processing getting reprocessed

In the ideal endgoal for clean shutdown, when we get a shutdown signal we would like to close the server for any new incoming events, let the current events process/drain out, and then shut down the orchestrator and server

Jul 18 '25 20:07 kohlisid

@kohlisid does python gRPC support drain/shutdown mode?

Jul 18 '25 21:07 vigith

@sapkota-aayush Would you want to test few scenarios with fmea.

Scaledown events where pods are killed

Panic is the user code (random)

Panic in the user code (consistent)

Also want to note down the behaviour of the events post the shutdown/restart

Are the pods coming back up seamless, or there are issues in server startup

Are the events that were left mid way in processing getting reprocessed

In the ideal endgoal for clean shutdown, when we get a shutdown signal we would like to close the server for any new incoming events, let the current events process/drain out, and then shut down the orchestrator and server

Hi @kohlisid,

Sorry for getting back to this late!

Thanks for the detailed testing scenarios.
I haven’t written tests for scaledown/pod-kill scenarios before.

Do you want me to:

Perform these tests manually and share the results, or
Write automated test cases for them as part of this PR?

This will help me approach it the right way.

Aug 06 '25 18:08 sapkota-aayush

Perform these tests manually and share the results, or @sapkota-aayush Let's do few of these first as fmea

Sep 04 '25 23:09 kohlisid

@sapkota-aayush any update on this?

Sep 29 '25 04:09 vigith

@sapkota-aayush any update on this?

Looking at it.

Sep 29 '25 19:09 sapkota-aayush