fix(sync-server): thread-safe shutdown and error reporting in SyncMapServicer
This PR fixes a race condition in the sync server where only the first error from multiple crashing threads was reported, and others were lost. Now, only the first error triggers shutdown and error reporting; others are ignored after shutdown starts. Removed temporary test files used for debugging. Closes #198 (Graceful shutdown of gRPC servers when there are exceptions in the User Code). I’m still exploring gRPC, so I may be wrong—open to any feedback or suggestions!
Codecov Report
Attention: Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.
Project coverage is 94.09%. Comparing base (
42f9fbd) to head (4ed0cf5). Report is 1 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| pynumaflow/mapper/_servicer/_sync_servicer.py | 73.68% | 3 Missing and 2 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #236 +/- ##
==========================================
- Coverage 94.26% 94.09% -0.17%
==========================================
Files 60 60
Lines 2441 2457 +16
Branches 124 128 +4
==========================================
+ Hits 2301 2312 +11
- Misses 101 104 +3
- Partials 39 41 +2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@kohlisid @vigith After applying the changes, I ran the race condition tests. The results show that only the first error triggers shutdown and error reporting, and all other errors are ignored after shutdown starts—just as intended. I may be mistaken, though. I’m waiting for your feedback!
@sapkota-aayush Would you want to test few scenarios with fmea.
- Scaledown events where pods are killed
- Panic is the user code (random)
- Panic in the user code (consistent)
Also want to note down the behaviour of the events post the shutdown/restart
- Are the pods coming back up seamless, or there are issues in server startup
- Are the events that were left mid way in processing getting reprocessed
In the ideal endgoal for clean shutdown, when we get a shutdown signal we would like to close the server for any new incoming events, let the current events process/drain out, and then shut down the orchestrator and server
@kohlisid does python gRPC support drain/shutdown mode?
@sapkota-aayush Would you want to test few scenarios with fmea.
- Scaledown events where pods are killed
- Panic is the user code (random)
- Panic in the user code (consistent)
Also want to note down the behaviour of the events post the shutdown/restart
- Are the pods coming back up seamless, or there are issues in server startup
- Are the events that were left mid way in processing getting reprocessed
In the ideal endgoal for clean shutdown, when we get a shutdown signal we would like to close the server for any new incoming events, let the current events process/drain out, and then shut down the orchestrator and server
Hi @kohlisid,
Sorry for getting back to this late!
Thanks for the detailed testing scenarios.
I haven’t written tests for scaledown/pod-kill scenarios before.
Do you want me to:
- Perform these tests manually and share the results, or
- Write automated test cases for them as part of this PR?
This will help me approach it the right way.
Perform these tests manually and share the results, or
@sapkota-aayush Let's do few of these first as fmea
@sapkota-aayush any update on this?
@sapkota-aayush any update on this?
Looking at it.