FedScale
FedScale copied to clipboard
[FedScale Core] Error handling when network package dropped
What happened + What you expected to happen
I've noticed some common problems when network package dropped in real depolyment, and I have some proposal regarding these problems. I've discussed with @fanlai0990, and I would like to hear from more contributors to figure out the best plan. @mosharaf @AmberLJC @ewenw @IKACE
- problem: server->client UPDATE_MODEL package dropped, server->client MODEL_TEST in error (stale model/no model) solution: ignore UPDATE_MODEL, send model in MODEL_TEST package
- problem: server->client CLIENT_TRAIN package dropped, server->client DUMMY_EVENT forever
solution: keep event inside queue until client confirm event completed
pitfall:
- multi-thread executor may ping the same event more than once
- UPDATE_MODEL no confirmation, no way to tell if UPDATE_MODEL finished
Versions / Dependencies
fedscale-0.5 server: ubuntu 16 client: android 23
Reproduction script
Issue Severity
High: It blocks me from completing my task.
In the future, we need to collect the piggyback information of each call before popping the event queue.
Hi Chengsong, Good observation on the package drop!
- I think sending model in MODEL_TEST makes sense to me.
- Related to the pitfall, if in real deployment then each executor is one client, then keeping event in the queue should be okay? In terms of UPDATE_MODEL confirmation, isn't this line doing the confirmation?