FedScale icon indicating copy to clipboard operation
FedScale copied to clipboard

[FedScale Core] Error handling when network package dropped

Open continue-revolution opened this issue 2 years ago • 2 comments

What happened + What you expected to happen

I've noticed some common problems when network package dropped in real depolyment, and I have some proposal regarding these problems. I've discussed with @fanlai0990, and I would like to hear from more contributors to figure out the best plan. @mosharaf @AmberLJC @ewenw @IKACE

  1. problem: server->client UPDATE_MODEL package dropped, server->client MODEL_TEST in error (stale model/no model) solution: ignore UPDATE_MODEL, send model in MODEL_TEST package
  2. problem: server->client CLIENT_TRAIN package dropped, server->client DUMMY_EVENT forever solution: keep event inside queue until client confirm event completed pitfall:
    • multi-thread executor may ping the same event more than once
    • UPDATE_MODEL no confirmation, no way to tell if UPDATE_MODEL finished

Versions / Dependencies

fedscale-0.5 server: ubuntu 16 client: android 23

Reproduction script

Issue Severity

High: It blocks me from completing my task.

continue-revolution avatar Feb 17 '23 22:02 continue-revolution

In the future, we need to collect the piggyback information of each call before popping the event queue.

fanlai0990 avatar Feb 21 '23 06:02 fanlai0990

Hi Chengsong, Good observation on the package drop!

  1. I think sending model in MODEL_TEST makes sense to me.
  2. Related to the pitfall, if in real deployment then each executor is one client, then keeping event in the queue should be okay? In terms of UPDATE_MODEL confirmation, isn't this line doing the confirmation?

IKACE avatar Feb 21 '23 22:02 IKACE