Feature Request

Support a YARP telemetry library, so we can get the health state of YARP Cluster/Destination. And report useful telemetry to monitor the health of Clusters/Destinations.

Why it is useful?

It would help us to monitor and checks the health status of target clusters/destinations. Since YARP already support health checks and have the health state, it would be a huge value and upgrade to provide a way to easily report health state.

It would help the host service owner to monitor, and easily debug based on the health state telemetry of target cluster.
And would help the client (target clusters) to monitor their service health.

For example, we are a Microsoft team building a gateway with YARP and there are many clients team service onboarded. With this telemetry, we can then build useful alerts and dashboards to check client team's health.

Current gap

Currently YARP only expose HealthStates in IDestinationHealthUpdater, we would have to use some hacky way to make a wrapper of IDestinationHealthUpdater and inject our own code to emit telemetry in SetActive method.

Mar 29 '22 00:03 nlyu

@davidni

Mar 29 '22 00:03 nlyu

From the description I take it you are mostly interested in having some notification when the health status changes? There's a bit of discussion about it here: #1515

As a potentially less painful alternative to IDestinationHealthUpdater, you can look at the IProxyStateLookup interface we added in the 1.1.0 rc. You can use this to poll the state of all clusters and their destinations.

Mar 29 '22 09:03 MihaZupan

@MihaZupan oh this is nice, thx! we are still consuming 1.0 and dont realize the new feature here.

Mar 29 '22 21:03 nlyu

For scenarios where all you need is access to the health state of each destination, you can use the new IProxyStateLookup interface that shipped with Yarp 1.1 to query the current proxy state.

An issue that was identified on the email discussion was that Yarp will only react to health changes and update destinations once all health probes for a given cluster complete. If a probe request took a long time to complete, other destinations may not be updated in a timely fashion. With the default timeout being 10 seconds, this may not be that big of an issue.

Ideally it should also be possible to correlate probe requests with telemetry events (IHttpTelemetryConsumer).

May 17 '22 17:05 MihaZupan

Triage: Makes sense for 2.0

May 26 '22 16:05 karelz

[Feature Request] Add Telemetry consumption library for YARP health check

Feature Request

Why it is useful?

Current gap