envoy Missing metrics for resource monitor global_downstream_max

Title: Missing metrics for resource monitor global_downstream_max_connections

Description: When the Overload Manager is enabled with the envoy.resource_monitors.downstream_connections resource monitor, some metrics are missing. Only the failed_updates metric is available.

overload.envoy.resource_monitors.global_downstream_max_connections.failed_updates: 0

The pressure and skipped_updates are missing as stated in the documentation. Other resource monitors are not impacted.

overload.envoy.resource_monitors.fixed_heap.failed_updates: 0
overload.envoy.resource_monitors.fixed_heap.pressure: 6
overload.envoy.resource_monitors.fixed_heap.skipped_updates: 0

refresh_interval:
  seconds: 0
  nanos: 250000000
resource_monitors:
  - name: "envoy.resource_monitors.global_downstream_max_connections"
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.resource_monitors.downstream_connections.v3.DownstreamConnectionsConfig
      max_active_downstream_connections: 1000
actions:
  - name: "envoy.overload_actions.disable_http_keepalive"
    triggers:
      - name: "envoy.resource_monitors.global_downstream_max_connections"
        threshold:
          value: 0.95

Is it intended?

May 21 '24 20:05 aerialls

The downstream connections resource monitor has a different style and only updates the failed_updates metric.

@nezdolik, can you comment on whether that's intentional?

May 23 '24 17:05 zuercher

@aerialls @zuercher this is indeed a different type of monitor and it does not work the same way as all other resource monitors. It is a proactive resource monitor, a faster one compared to the others, it checks resource usage inline and does not wait for overload manager flush period to kick in to evaluate all resource usage. We did not want to reveal implementation details to users. @aerialls could you link a doc where it mentions that this stats (pressure, skipped_updates) are reported by global_downstream_max_connections? We will update the docs.

May 24 '24 21:05 nezdolik

Thank you for the explanations! The documentation page is this one, at the end of it.

Each configured resource monitor has a statistics tree rooted at overload.<name>. with the following statistics:

The documentation does not have any notion of proactive or standard monitors. I expected to see the metrics for the global_downstream_max_connections monitor as it is listed with other monitors.

Is it possible to have a dedicated metric to reflect the value of the global_downstream_max_connections configured in the overload manager? My use case is to monitor the current usage for each monitor and as I'm missing the pressure metric for this one, I must find another way. One can be to use the server.total_connections metric for the current usage but we are missing the max limit configured to manually compute the current pressure. I have different max connections configured depending of the resources available on the machine so having the possibility to retrieve the max value dynamically would be a big help.

I tried to do it myself but the ThreadLocalOverloadState class is not exposing the max limit so this is not as simple as I imagined by updating the InstanceBase::updateServerStats method.

Also, will the actions associated with the global_downstream_max_connections monitor be executed correctly? I'm not sure as the documentation always shows configurations for this monitor without actions, meaning that the monitor only block new connections when the limit is reached but we can't do anything else.

Thanks for your time!

May 25 '24 16:05 aerialls

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Jun 24 '24 20:06 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Jul 02 '24 00:07 github-actions[bot]

Also, will the actions associated with the global_downstream_max_connections monitor be executed correctly? I'm not sure as the documentation always shows configurations for this monitor without actions, meaning that the monitor only block new connections when the limit is reached but we can't do anything else.

I'm also curious to understand this ☝🏻 🤔

Dec 13 '24 14:12 eduardobaitello