Yi Cheng

Results 77 comments of Yi Cheng

> if we have fault tolerance semantics documented somewhere. @rkooo567 I actually put some workflow [here](https://github.com/ray-project/ray/blob/972caacc365d7bf17f7e3916cbbc41b7196b0654/src/ray/common/ray_syncer/ray_syncer-inl.h#L91-L117) It at least give the developer an overview of how the state changed.

all test passed when flag is on https://buildkite.com/ray-project/oss-ci-build-pr/builds/10260 ray syncer test failed in asan. i'm going to take another look, but i feel it's close.

@edoakes I think it depends on the server side. As long as the master in the cluster is alive, it should work. So let's say: 1. Ray is running and...

> > 1. Ray is running and the master failed => GCS won't be back. (This can be improved in the future with more work). > > 2. Start a...

@edoakes I think your point about db becomes single point failure makes sense. The challenging part is that if the master is down, we need to redirect all the requests...

@edoakes this is correct. I think one step further is to make it work with the case when master is down. We have two ways: 1. improve the current code...

@edoakes [here](https://github.com/ray-project/ray/blob/master/src/ray/gcs/store_client/store_client.h#L59) you can see the API return Status and the callback doesn't take care of the failure or success. redis++ needs the callback take care of the error code....

come back to this work after the oncall works. Interesting, scan and hscan actually return different type of values...

seems only cpp testing failure related. almost there.