[Feature Request] Add Telemetry consumption library for YARP health check
Feature Request
Support a YARP telemetry library, so we can get the health state of YARP Cluster/Destination. And report useful telemetry to monitor the health of Clusters/Destinations.
Why it is useful?
It would help us to monitor and checks the health status of target clusters/destinations. Since YARP already support health checks and have the health state, it would be a huge value and upgrade to provide a way to easily report health state.
- It would help the host service owner to monitor, and easily debug based on the health state telemetry of target cluster.
- And would help the client (target clusters) to monitor their service health.
For example, we are a Microsoft team building a gateway with YARP and there are many clients team service onboarded. With this telemetry, we can then build useful alerts and dashboards to check client team's health.
Current gap
Currently YARP only expose HealthStates in IDestinationHealthUpdater, we would have to use some hacky way to make a wrapper of IDestinationHealthUpdater and inject our own code to emit telemetry in SetActive method.
@davidni
From the description I take it you are mostly interested in having some notification when the health status changes? There's a bit of discussion about it here: #1515
As a potentially less painful alternative to IDestinationHealthUpdater, you can look at the IProxyStateLookup interface we added in the 1.1.0 rc. You can use this to poll the state of all clusters and their destinations.
@MihaZupan oh this is nice, thx! we are still consuming 1.0 and dont realize the new feature here.
For scenarios where all you need is access to the health state of each destination, you can use the new IProxyStateLookup interface that shipped with Yarp 1.1 to query the current proxy state.
An issue that was identified on the email discussion was that Yarp will only react to health changes and update destinations once all health probes for a given cluster complete. If a probe request took a long time to complete, other destinations may not be updated in a timely fashion. With the default timeout being 10 seconds, this may not be that big of an issue.
Ideally it should also be possible to correlate probe requests with telemetry events (IHttpTelemetryConsumer).
Triage: Makes sense for 2.0