zetaclientd support for multiple RPCs per chain w/ failover
Each zetaclientd node needs a reliable RPC for each external chain and if that RPC fails it needs to connect a new one. If zetaclient can support multiple RPCs per chain and support automatic failover operators don't need to build their own highly available centralized solutions.
Since the external chain connectivity is critical for ongoing operations it would be good to offer failover capabilities in an easy and decentralized way.
To support multiple RPCs per chain for failover, zetaclient needs to be able to measure the health status of each endpoint and switch to a healthy one when the current one is unhealthy.
Why health status measurement is needed?
- Current RPC endpoint can be unhealthy (e.g., down, ran out of entitled requests, etc.).
-
zetaclientneeds a number/threshold of RPC health index to determine if RPC rotation is necessary. RPC rotation should not be triggered frequently due to a few RPC failures. - When rotation is needed,
zetaclientneeds to decide which one of other RPC endpoints is good to switch to. RPC rotation should not be done blindly
How to measure the health of each RPC endpoint?
- For each external chain, zetaclient creates a dedicated
RPCHealthCheckgoroutine. -
RPCHealthCheckgoroutine does basic health checks (get block number, block header, gas price, unspent UTXOs etc.) against each RPC endpoint candidate on tickers. Thetickeris configurable, e.g., every 15 seconds. -
RPCHealthCheckgoroutine maintains a history of health check results for each RPC endpoint candidate. Thehistory lengthis configurable, e.g., the last 5 minutes (20 tickers). -
RPCHealthCheckgoroutine defines aRPCRotationHealthCheckFailRateto indicate the minimum percentage of failed health checks in the history to trigger a RPC rotation. TheRPCRotationHealthCheckFailRateis configurable, e.g., 40%. - When
RPCRotationHealthCheckFailRateis reached,zetaclientfinds the healthiest RPC endpoint candidateR(other than the current one) and switches to endpointRonly ifRis healthier than the current one.
How to feed zetaclient with multiple RPC endpoints?
zetaclient runner needs to fill in the list of RPC endpoints in config file and then restarts zetaclientd.
Which endpoint should be used when zetaclientd starts?
zetaclient simply picks up the first endpoint in the list to start with. If the first endpoint happens to be unhealthy, the RPC rotation will be triggered later according to above health check mechanism within 5 minutes.
What happens when all endpoints are unhealthy (e.g. down)?
zetaclient will still choose to work on the healthiest endpoint in the list based on above machanism, even though it is unhealthy (or not working at all).
What metrics can be exposed?
- The index (0-based) or label of the current RPC endpoint being used by
zetaclientdper chain. - The health index (percentage of successful health checks) of each RPC endpoint candidate per chain.
We'll need to make sure that the RPCHealthCheck goroutine has some tolerance for some RPC providers being slower than others with regard to blockheight.
It would likely also be good to handle a 429 response a little differently than another kind of rpc unhealthiness. I could see a situation where a zetaclient using an endpoint that has been rate limited manages to pass a healthcheck only to have subsequent requests quickly rate limited again before the health check runs.
This situation though might be better handled with the existing zetaclient routine handling the receipt of a 429 by triggering a rotation? Is this overkill should zetaclient eat the 429s until it runs a health check again?
I've come full circle on this issue. Originally I thought we should make it as easy as possible for a zetaclient operator to run a node successfully and that includes the ability to failover to a new RPC endpoint if needed.
I still think that would be helpful, but is a nice to have to feature not a requirement. We've been encountering a lot less RPC issues as we've improved the quality of RPC providers and made other improvements.
I think there are some alternative approaches that we could do instead of implementing it in Zetaclient directly. For example, providing documentation and/or docker images to run the dschackle EVM RPC load balancer we've begun testing in other areas.