reverse-proxy icon indicating copy to clipboard operation
reverse-proxy copied to clipboard

Request to have metrics emitted on health check failures

Open timmydo opened this issue 2 years ago • 4 comments

What should we add or change to make your life better?

Requesting metrics in some of the default classes such as: https://github.com/microsoft/reverse-proxy/blob/04b1fe8d8cd0f25b814805c2fb91c68fb26c7f29/src/ReverseProxy/Health/DestinationHealthUpdater.cs#L43

https://github.com/microsoft/reverse-proxy/blob/04b1fe8d8cd0f25b814805c2fb91c68fb26c7f29/src/ReverseProxy/Health/ActiveHealthCheckMonitor.cs#L197C31-L197C31

https://learn.microsoft.com/en-us/dotnet/core/diagnostics/metrics-collection

Specifically, today, I'm looking for a way to have alerts go off when our health checks fail without reimplementing YARP's default implementations. More generally it would be nice if YARP had metrics to complement the Log statements.

Why is this important to you?

We are able to set up alerts on metrics but not on individual log messages. Metrics can alert the on-call engineers to look at the logs for further information.

timmydo avatar Sep 15 '23 16:09 timmydo

here is a quick list I put together by searching for Log. and removing some that didn't look relevant:

src\ReverseProxy\Configuration\ConfigProvider\ConfigurationConfigProvider.cs:
   70:             Log.LoadData(_logger);

   88:                 Log.ConfigurationDataConversionFailed(_logger, ex);

  110:                 Log.ErrorSignalingChange(_logger, ex);


src\ReverseProxy\Forwarder\ForwarderHttpClientFactory.cs:
  40:             Log.ClientReused(_logger, context.ClusterId);

  60:         Log.ClientCreated(_logger, context.ClusterId);

src\ReverseProxy\Forwarder\ForwarderMiddleware.cs:
  50:             Log.NoAvailableDestinations(_logger, cluster.ClusterId);

  62:             Log.MultipleDestinationsAvailable(_logger, cluster.ClusterId);

  79:             ForwarderTelemetry.Log.ForwarderInvoke(cluster.ClusterId, route.Config.RouteId, destination.DestinationId);

src\ReverseProxy\Forwarder\HttpForwarder.cs:
  150:                     Log.NotProxying(_logger, context.Response.StatusCode);

  154:                 Log.Proxying(_logger, destinationRequest, isStreamingRequest);

  171:                         Log.RetryingWebSocketDowngradeNoConnect(_logger);

  179:                         Log.RetryingWebSocketDowngradeNoHttp2(_logger);

  214:             Log.ResponseReceived(_logger, destinationResponse);

  490:                 Log.InvalidSecWebSocketKeyHeader(_logger, key);

  922:         Log.ErrorProxying(_logger, error, ex);

src\ReverseProxy\Health\ActiveHealthCheckMonitor.cs:
   61:                 Log.ExplicitActiveCheckOfAllClustersHealthFailed(_logger, ex);

  116:         Log.StartingActiveHealthProbingOnCluster(_logger, cluster.ClusterId);

  142:             Log.ActiveHealthProbingFailedOnCluster(_logger, cluster.ClusterId, ex);

  156:                 Log.ErrorOccuredDuringActiveHealthProbingShutdownOnCluster(_logger, cluster.ClusterId, ex);
  158  
  159:             Log.StoppedActiveHealthProbingOnCluster(_logger, cluster.ClusterId);

  176:             Log.ActiveHealthProbeConstructionFailedOnCluster(_logger, destination.DestinationId, cluster.ClusterId, ex);
  177  

  187:             Log.SendingHealthProbeToEndpointOfDestination(_logger, request.RequestUri, destination.DestinationId, cluster.ClusterId);
  189:             Log.DestinationProbingCompleted(_logger, destination.DestinationId, cluster.ClusterId, (int)response.StatusCode);

  197:             Log.DestinationProbingFailed(_logger, destination.DestinationId, cluster.ClusterId, ex);

src\ReverseProxy\Health\DestinationHealthUpdater.cs:
   43:                     Log.ActiveDestinationHealthStateIsSetToUnhealthy(_logger, destination.DestinationId, cluster.ClusterId);

   47:                     Log.ActiveDestinationHealthStateIsSet(_logger, destination.DestinationId, cluster.ClusterId, newHealth);

   85:             Log.UnhealthyDestinationIsScheduledForReactivation(_logger, destination.DestinationId, reactivationPeriod);

  100:             Log.PassiveDestinationHealthResetToUnkownState(_logger, destination.DestinationId);


src\ReverseProxy\LoadBalancing\LoadBalancingMiddleware.cs:
  59:             Log.NoAvailableDestinations(_logger, proxyFeature.Cluster.Config.ClusterId);


src\ReverseProxy\Model\ProxyPipelineInitializerMiddleware.cs:
  40:             Log.NoClusterFound(_logger, route.Config.RouteId);


src\ReverseProxy\WebSocketsTelemetry\WebSocketsTelemetryMiddleware.cs:
  56:                 WebSocketsTelemetry.Log.WebSocketClosed(

  80:                 WebSocketsTelemetry.Log.WebSocketClosed(

timmydo avatar Sep 15 '23 16:09 timmydo

@timmydo Thanks for reporting this. We believe there's value in making some of these metrics, but perhaps not all of them (e.g. error conditions that should really be addressed immediately).

If you could provide a list of the ones that are the highest priority to add metrics for, that would be great.

adityamandaleeka avatar Sep 19 '23 17:09 adityamandaleeka

Triage: perhaps another interesting metric might be a catch-all "this request couldn't be proxied" metric.

adityamandaleeka avatar Sep 19 '23 17:09 adityamandaleeka

Counters for these files above look to be higher priority, with higher priority being to logs that would be error or warning:

src\ReverseProxy\Forwarder\ForwarderMiddleware.cs
src\ReverseProxy\Health\DestinationHealthUpdater.cs
src\ReverseProxy\Health\ActiveHealthCheckMonitor.cs
src\ReverseProxy\LoadBalancing\LoadBalancingMiddleware.cs

having counters for successful requests is useful also to validate whether no errors is a data-missing scenario or no errors. Also, you can measure the reliability.

timmydo avatar Sep 21 '23 19:09 timmydo