node icon indicating copy to clipboard operation
node copied to clipboard

zetaclientd support for multiple RPCs per chain w/ failover

Open CharlieMc0 opened this issue 2 years ago • 3 comments

Each zetaclientd node needs a reliable RPC for each external chain and if that RPC fails it needs to connect a new one. If zetaclient can support multiple RPCs per chain and support automatic failover operators don't need to build their own highly available centralized solutions.

Since the external chain connectivity is critical for ongoing operations it would be good to offer failover capabilities in an easy and decentralized way.

CharlieMc0 avatar Nov 20 '23 23:11 CharlieMc0

To support multiple RPCs per chain for failover, zetaclient needs to be able to measure the health status of each endpoint and switch to a healthy one when the current one is unhealthy.

Why health status measurement is needed?

  • Current RPC endpoint can be unhealthy (e.g., down, ran out of entitled requests, etc.).
  • zetaclient needs a number/threshold of RPC health index to determine if RPC rotation is necessary. RPC rotation should not be triggered frequently due to a few RPC failures.
  • When rotation is needed, zetaclient needs to decide which one of other RPC endpoints is good to switch to. RPC rotation should not be done blindly

How to measure the health of each RPC endpoint?

  1. For each external chain, zetaclient creates a dedicated RPCHealthCheck goroutine.
  2. RPCHealthCheck goroutine does basic health checks (get block number, block header, gas price, unspent UTXOs etc.) against each RPC endpoint candidate on tickers. The ticker is configurable, e.g., every 15 seconds.
  3. RPCHealthCheck goroutine maintains a history of health check results for each RPC endpoint candidate. The history length is configurable, e.g., the last 5 minutes (20 tickers).
  4. RPCHealthCheck goroutine defines a RPCRotationHealthCheckFailRate to indicate the minimum percentage of failed health checks in the history to trigger a RPC rotation. The RPCRotationHealthCheckFailRate is configurable, e.g., 40%.
  5. When RPCRotationHealthCheckFailRate is reached, zetaclient finds the healthiest RPC endpoint candidate R (other than the current one) and switches to endpoint R only if R is healthier than the current one.

How to feed zetaclient with multiple RPC endpoints? zetaclient runner needs to fill in the list of RPC endpoints in config file and then restarts zetaclientd.

Which endpoint should be used when zetaclientd starts? zetaclient simply picks up the first endpoint in the list to start with. If the first endpoint happens to be unhealthy, the RPC rotation will be triggered later according to above health check mechanism within 5 minutes.

What happens when all endpoints are unhealthy (e.g. down)? zetaclient will still choose to work on the healthiest endpoint in the list based on above machanism, even though it is unhealthy (or not working at all).

What metrics can be exposed?

  1. The index (0-based) or label of the current RPC endpoint being used by zetaclientd per chain.
  2. The health index (percentage of successful health checks) of each RPC endpoint candidate per chain.

ws4charlie avatar May 06 '24 16:05 ws4charlie

We'll need to make sure that the RPCHealthCheck goroutine has some tolerance for some RPC providers being slower than others with regard to blockheight.

It would likely also be good to handle a 429 response a little differently than another kind of rpc unhealthiness. I could see a situation where a zetaclient using an endpoint that has been rate limited manages to pass a healthcheck only to have subsequent requests quickly rate limited again before the health check runs.

This situation though might be better handled with the existing zetaclient routine handling the receipt of a 429 by triggering a rotation? Is this overkill should zetaclient eat the 429s until it runs a health check again?

CryptoFewka avatar May 06 '24 17:05 CryptoFewka

I've come full circle on this issue. Originally I thought we should make it as easy as possible for a zetaclient operator to run a node successfully and that includes the ability to failover to a new RPC endpoint if needed.

I still think that would be helpful, but is a nice to have to feature not a requirement. We've been encountering a lot less RPC issues as we've improved the quality of RPC providers and made other improvements.

I think there are some alternative approaches that we could do instead of implementing it in Zetaclient directly. For example, providing documentation and/or docker images to run the dschackle EVM RPC load balancer we've begun testing in other areas.

CharlieMc0 avatar May 06 '24 18:05 CharlieMc0