prometheus-proxy icon indicating copy to clipboard operation
prometheus-proxy copied to clipboard

CPU throttling on PromProxy node with 1300+ agents

Open mhussain584 opened this issue 9 months ago • 97 comments

Hi Paul, Wishes Hope you are doing well

Again thanks for your help with release of 1.23.1 last time which fixed my issue.

Now, i've encountered another issue, where i am sending metrics of 1300+ agents with like 500+ dummy agents, prometheus-proxy service is experiencing CPU throttling

Current Node Resource

8 Cores 
32 GB memory

I've upgraded to 2.1.0 for both agent and proxy, still the throttling is evident

PromPorxy logs

  16:33:30.299 INFO  [CallLogging.kt:48] - 200 OK: GET - /blmxx04_hssidsmac_virtual_09_10_147_233_111_hs_metrics - ip-100-65-70-119.ec2.internal [DefaultDispatcher-worker-31]
16:33:30.302 INFO  [CallLogging.kt:48] - 200 OK: GET - /blmxx06_hssidsmac_virtual_09_10_147_233_117_hs_metrics - ip-100-65-70-119.ec2.internal [DefaultDispatcher-worker-36]
16:33:30.305 INFO  [CallLogging.kt:48] - 503 Service Unavailable: GET - /blmxx03_hssidsmac_virtual_15_10_147_233_105_hs_metrics - ip-100-65-70-119.ec2.internal [DefaultDispatcher-worker-62]
16:33:30.305 INFO  [ProxyHttpConfig.kt:103] -  Throwable caught: ClosedByteChannelException [DefaultDispatcher-worker-7]
io.ktor.utils.io.ClosedByteChannelException: Broken pipe
  at io.ktor.utils.io.CloseToken$wrapCause$1.invoke(CloseToken.kt:16)
  at io.ktor.utils.io.CloseToken$wrapCause$1.invoke(CloseToken.kt:16)
  at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
  at io.ktor.utils.io.CloseToken.wrapCause$default(CloseToken.kt:16)
  at io.ktor.utils.io.ByteChannel.getClosedCause(ByteChannel.kt:61)
  at io.ktor.utils.io.ByteReadChannelOperationsKt.rethrowCloseCauseIfNeeded(ByteReadChannelOperations.kt:543)
  at io.ktor.utils.io.ByteChannel.flush(ByteChannel.kt:94)
  at io.ktor.utils.io.CloseHookByteWriteChannel.flush(CloseHookByteWriteChannel.kt)
  at io.ktor.utils.io.ByteReadChannelOperationsKt.copyTo(ByteReadChannelOperations.kt:206)
  at io.ktor.utils.io.ByteReadChannelOperationsKt$copyTo$2.invokeSuspend(ByteReadChannelOperations.kt)
  at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
  at kotlinx.coroutines.DispatchedTaskKt.resume(DispatchedTask.kt:233)
  at kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined(DispatchedTask.kt:175)
  at kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:147)
  at kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:470)
  at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:504)
  at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
  at kotlinx.coroutines.CancellableContinuationImpl.resumeWith(CancellableContinuationImpl.kt:359)
  at io.ktor.utils.io.ByteChannel$Slot$Task$DefaultImpls.resume(ByteChannel.kt:239)
  at io.ktor.utils.io.ByteChannel$Slot$Read.resume(ByteChannel.kt:242)
  at io.ktor.utils.io.ByteChannel.closeSlot(ByteChannel.kt:177)
  at io.ktor.utils.io.ByteChannel.flushAndClose(ByteChannel.kt:133)
  at io.ktor.utils.io.ByteWriteChannelOperationsKt$writer$job$1.invokeSuspend(ByteWriteChannelOperations.kt:184)
  at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
  at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
  at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:586)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:829)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:717)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:704)
Caused by: io.ktor.utils.io.ClosedWriteChannelException: Broken pipe
  at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
  at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
  at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
  at io.ktor.utils.io.CloseToken.throwOrNull(CloseToken.kt:26)
  at io.ktor.utils.io.ByteChannel.getWriteBuffer(ByteChannel.kt:54)
  at io.ktor.utils.io.ByteWriteChannelOperationsKt.writeByte(ByteWriteChannelOperations.kt:19)
  at io.ktor.http.cio.internals.CharsKt.writeIntHex(Chars.kt:110)
  at io.ktor.http.cio.ChunkedTransferEncodingKt.writeChunk(ChunkedTransferEncoding.kt:167)
  at io.ktor.http.cio.ChunkedTransferEncodingKt.access$writeChunk(ChunkedTransferEncoding.kt:1)
  at io.ktor.http.cio.ChunkedTransferEncodingKt.encodeChunked(ChunkedTransferEncoding.kt:138)
  at io.ktor.http.cio.ChunkedTransferEncodingKt$encodeChunked$2.invokeSuspend(ChunkedTransferEncoding.kt)
  at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
  at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
  at kotlinx.coroutines.EventLoop.processUnconfinedEvent(EventLoop.common.kt:65)
  at kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined(DispatchedTask.kt:243)
  at kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:147)
  at kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:470)
  at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:504)
  at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
  at kotlinx.coroutines.CancellableContinuationImpl.resumeWith(CancellableContinuationImpl.kt:359)
  at io.ktor.utils.io.ByteChannel$Slot$Task$DefaultImpls.resume(ByteChannel.kt:236)
  at io.ktor.utils.io.ByteChannel$Slot$Read.resume(ByteChannel.kt:242)
  at io.ktor.utils.io.ByteChannel.flushWriteBuffer(ByteChannel.kt:389)
  at io.ktor.utils.io.ByteChannel.flush(ByteChannel.kt:96)
  at io.ktor.utils.io.ByteChannel.flushAndClose(ByteChannel.kt:128)
  ... 7 common frames omitted
Caused by: io.ktor.utils.io.ClosedWriteChannelException: Broken pipe
  at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
  at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
  at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
  at io.ktor.utils.io.CloseToken.throwOrNull(CloseToken.kt:26)
  at io.ktor.utils.io.ByteChannel.getWriteBuffer(ByteChannel.kt:54)
  at io.ktor.utils.io.ByteReadChannelOperationsKt.copyTo(ByteReadChannelOperations.kt:175)
  at io.ktor.utils.io.ByteReadChannelOperationsKt$copyTo$1.invokeSuspend(ByteReadChannelOperations.kt)
  at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
  at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
  at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:113)
  at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:89)
  at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:586)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:820)
  ... 2 common frames omitted
Caused by: java.io.IOException: Broken pipe
  at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
  at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:62)
  at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:132)
  at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:97)
  at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:53)
  at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:532)
  at io.ktor.network.sockets.CIOWriterKt$attachForWritingDirectImpl$1.invokeSuspend$lambda$1(CIOWriter.kt:42)
  at io.ktor.utils.io.core.ByteReadPacketExtensions_jvmKt.read(ByteReadPacketExtensions.jvm.kt:30)
  at io.ktor.network.sockets.CIOWriterKt$attachForWritingDirectImpl$1.invokeSuspend(CIOWriter.kt:78)
  ... 8 common frames omitted
16:33:30.306 INFO  [CallLogging.kt:48] - 200 OK: GET - /blmxx03_hssidsmac_virtual_04_10_147_234_71_hs_metrics - ip-100-65-70-119.ec2.internal [DefaultDispatcher-worker-7]

Container resource usage

Image

Prom-agent.conf (Just a snippet)

proxy {
  admin.debugEnabled = true

  admin.enabled: true
  metrics.enabled: true

  http.requestLoggingEnabled: true
}

agent {

  proxy.hostname: "promproxy-us-east-1.aws-prod.examplecom:50051"
  admin.enabled: true
  metrics.enabled: true

  pathConfigs: [
    {
      name: "Agent System metrics"
      path: blmxx_agent_sys_metrics
      url: "http://10.147.92.199:9100/metrics"
    },
    {
      name: "Agent metrics"
      path: blmxx_agent_metrics
      labels: "{\"cloud\": \"blmxx\"}"
      url: "http://localhost:8083/metrics"
    },
    {
      name: "blmxx01.iosserver-5xxxxxxxx38.app_metrics"
      path: "blmxx01_iosserver_app_metrics_10_147_234_98_9091"
      labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\"}"
      url: "http://10.147.234.98:9091/metrics"
    },
    {
      name: "blmxx01.iosserver-5xxxxxxxx38.sys_metrics"
      path: "blmxx01_iosserver_sys_metrics_10_147_234_98_9100"
      labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\"}"
      url: "http://10.147.234.98:9100/metrics"
    }
,
    {
      name: "blmxx01.iosserver-5xxxxxxxx38VIRTUAL_01"
      path: "blmxx01_iosserver_virtual_01_10_147_234_98_hs_metrics"
      labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"01\"}"
      url: "http://10.147.234.98/VIRTUAL/01/hs/metrics"
    },
    {
      name: "blmxx01.iosserver-5xxxxxxxx38.VIRTUAL_02"
      path: "blmxx01_iosserver_virtual_02_10_147_234_98_hs_metrics"
      labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"02\"}"
      url: "http://10.147.234.98/VIRTUAL/02/hs/metrics"
    },
    {
      name: "blmcm01.iosserver-5xxxxxxxx38.VIRTUAL_03"
      path: "blmxx01_iosserver_virtual_03_10_147_234_98_hs_metrics"
      labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"03\"}"
      url: "http://10.147.234.98/VIRTUAL/03/hs/metrics"
    },
    {
      name: "blmxx01.iosserver-5xxxxxxxx38.VIRTUAL_04"
      path: "blmxx01_iosserver_virtual_04_10_147_234_98_hs_metrics"
      labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"04\"}"
      url: "http://10.147.234.98/VIRTUAL/04/hs/metrics"
    },
    {
      name: "blmxx01.iosserver-5xxxxxxxx38.VIRTUAL_05"
      path: "blmxx01_iosserver_virtual_05_10_147_234_98_hs_metrics"
      labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"05\"}"
      url: "http://10.147.234.98/VIRTUAL/05/hs/metrics"
    },
    {
      name: "blmxx0.iosserver-5xxxxxxxx38.VIRTUAL_06"
      path: "blmxx01_iosserver_virtual_06_10_147_234_98_hs_metrics"
      labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"06\"}"
      url: "http://10.147.234.98/VIRTUAL/06/hs/metrics"
    },
    {

Your help on this will be appreciated

Thanks

mhussain584 avatar Jul 01 '25 16:07 mhussain584

Weird. So with fewer agents you are not seeing problems?

pambrose avatar Jul 01 '25 17:07 pambrose

Can you RE-open the case i've accidently closed it

mhussain584 avatar Jul 01 '25 17:07 mhussain584

Weird. So with fewer agents you are not seeing problems?

Yes , with fewer agents it works fine

Could it be due to unlimited coroutines getting created for large number of paths leading to this?

Does the proxy fans out to all registered paths with unbounded coroutines?

Probably the dispatcher is getting overwhelmed due to these thousands of coroutines, i've even tried to upscale the node to 16 cores but capacity doesn't seem to be the issue

Also, i've tried to run a benchmark (ab), with single path and it works fine , looks like hundreds of paths / endpoints (coroutines) for agents are saturating it

➜  ~ ab -n 1500 -c 100 http://promproxy-us-east-1.aws-prod.example.com:8080/metrics
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking promproxy-us-east-1.aws-prod.example.com (be patient)
Completed 150 requests
Completed 300 requests
Completed 450 requests
Completed 600 requests
Completed 750 requests
Completed 900 requests
Completed 1050 requests
Completed 1200 requests
Completed 1350 requests
Completed 1500 requests
Finished 1500 requests


Server Software:        Ktor/3.1.1
Server Hostname:        promproxy-us-east-1.aws-prod.example.com
Server Port:            8080

Document Path:          /metrics
Document Length:        13 bytes

Concurrency Level:      100
Time taken for tests:   11.734 seconds
Complete requests:      1500
Failed requests:        0
Non-2xx responses:      1500
Total transferred:      382500 bytes
HTML transferred:       19500 bytes
Requests per second:    127.84 [#/sec] (mean)
Time per request:       782.251 [ms] (mean)
Time per request:       7.823 [ms] (mean, across all concurrent requests)
Transfer rate:          31.83 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      356  361   4.5    359     384
Processing:   357  368  11.6    363     420
Waiting:      357  367  10.7    363     419
Total:        714  729  13.4    723     777

Percentage of the requests served within a certain time (ms)
  50%    723
  66%    726
  75%    735
  80%    740
  90%    750
  95%    758
  98%    765
  99%    772
 100%    777 (longest request)
➜  ~

mhussain584 avatar Jul 01 '25 17:07 mhussain584

Are you running the proxy in a docker container? I am wondering if we need to adjust the jvm flags.

pambrose avatar Jul 01 '25 18:07 pambrose

Yes , I am running it in a docker container

mhussain584 avatar Jul 01 '25 18:07 mhussain584

What JVM conf should be the appropriate?

  -e JAVA_OPTS="-Xms4g -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=100"

Or may be

docker run --rm \
  -p 8082:8082 -p 8092:8092 -p 50051:50051 -p 8080:8080 \
  --env ADMIN_ENABLED=true \
  --env METRICS_ENABLED=true \
  --env JAVA_TOOL_OPTIONS="-Xms4g -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=100" \
  pambrose/prometheus-proxy:2.1.0

mhussain584 avatar Jul 01 '25 18:07 mhussain584

Coroutines are lightweight, so I am not worried about that.

My guess is the issue is either with the JVM settings or the gRPC settings. I see the failure in the stacktrace above, but do we have any other clues about what the failure is?

pambrose avatar Jul 01 '25 18:07 pambrose

And when you say there is throttling, are you referring to the failure being seen at a sertain level of clients, i.e., not something more specific than that?

pambrose avatar Jul 01 '25 18:07 pambrose

I am able to see those error traces when the CPU scores over 100% and stays there for a while, other than i cannot find any other exception , i can take your help to validate JVM settings, anything else that i am missing

/app $ java -XX:+PrintFlagsFinal -version | grep HeapSize
   size_t ErgoHeapSizeLimit                        = 0                                         {product} {default}
   size_t HeapSizePerGCThread                      = 43620760                                  {product} {default}
   size_t InitialHeapSize                          = 515899392                                 {product} {ergonomic}
   size_t LargePageHeapSizeThreshold               = 134217728                                 {product} {default}
   size_t MaxHeapSize                              = 8229224448                                {product} {ergonomic}
   size_t MinHeapSize                              = 8388608                                   {product} {ergonomic}
    uintx NonNMethodCodeHeapSize                   = 7602480                                {pd product} {ergonomic}
    uintx NonProfiledCodeHeapSize                  = 122027880                              {pd product} {ergonomic}
    uintx ProfiledCodeHeapSize                     = 122027880                              {pd product} {ergonomic}
   size_t SoftMaxHeapSize                          = 8229224448                             {manageable} {ergonomic}
openjdk version "17.0.14" 2025-01-21
OpenJDK Runtime Environment (build 17.0.14+7-alpine-r0)
OpenJDK 64-Bit Server VM (build 17.0.14+7-alpine-r0, mixed mode, sharing)
/app $

mhussain584 avatar Jul 01 '25 18:07 mhussain584

And when you say there is throttling, are you referring to the failure being seen at a sertain level of clients, i.e., not something more specific than that?

The failure is not particular with certain clients, as i said earlier, there are 1300+ endpoints out of which some are dummy and its not that the failure is occurring on the dummy ones

mhussain584 avatar Jul 01 '25 18:07 mhussain584

Hi @pambrose I work with @mhussain584 , so know the context behind this issue. Just 1 correction here - We have only 1 prom-agent, but it has 1300+ different pathConfigs, (not 1300 agents)

About your question, we haven't checked if failures are seen on certain number of clients. It was working fine with around 75 pathConfigs, today we added 1250+ more pathConfigs, and saw this issue. Prometheus gives scrape error as -

Get "http://promproxy-us-east-1.aws-prod.example.com:8080/blmxxx_hssidsmac_virtual_11_10_147_232_104_hs_metrics": context deadline exceeded

pkazi avatar Jul 01 '25 18:07 pkazi

Hi @pkazi,

That error looks helpful. Do we know who is giving that message? When I google "context deadline exceeded" I see all the usual suspects. It doesn't sound JVM or gRPC related, though.

Does this look like a possibility? https://signoz.io/guides/context-deadline-exceeded-prometheus/

pambrose avatar Jul 01 '25 22:07 pambrose

Also, if we are unable to get to the bottom of this, would it be possible to run multiple prom-agents and distribute the 1300+ pathConfigs across them? Do not get me wrong, I want to identify the underlying issue.

pambrose avatar Jul 01 '25 22:07 pambrose

Thanks for the correction @pkazi , @pambrose The context deadline exceeded error above seems to be the cascading effect of what is happening in the proxy is my assumption, as my scrape job config consists of following

- job_name: hy_device_metrics
  honor_timestamps: true
  scrape_interval: 1m
  scrape_timeout: 30s
  metrics_path: /{__metrics_path__}
  scheme: http
  follow_redirects: true
  enable_http2: true
  relabel_configs:
  - source_labels: [__meta_sd_labels]
    separator: ;
    regex: .*cloud=(.+?),.*service=(.+?),.*ip=(.+?)
    target_label: cloud
    replacement: $1
    action: replace
  - source_labels: [__meta_sd_labels]
    separator: ;
    regex: .*cloud=(.+?),.*service=(.+?),.*ip=(.+?)
    target_label: service
    replacement: $2
    action: replace
  - source_labels: [__meta_sd_labels]
    separator: ;
    regex: .*cloud=(.+?),.*service=(.+?),.*ip=(.+?)
    target_label: ip
    replacement: $3
    action: replace
  http_sd_configs:
  - follow_redirects: true
    enable_http2: true
    refresh_interval: 30s
    url: http://promproxy-us-east-1.aws-prod.example.com:8080/discovery

So, when i explicitly try to curl http://promproxy-us-east-1.aws-prod.example.com:8080/discovery , Prometheus is scraping http://proxy:8080/{metrics_path} per target

Which is causing prometheus to call for every cradle, every IP, every service, etc

http://promproxy-us-east-1.aws-prod.example.com:8080/blmcm05_hssidsmac_virtual_09_10_147_92_196_hs_metrics
➜  ~ time curl -s http://promproxy-us-east-1.aws-prod.example.com:8080/discovery

overy
[
    {
        "targets": [
            "promproxy-us-east-1.aws-prod.example.com:8080"
        ],
        "labels": {
            "__metrics_path__": "blmxx05_hssidsmac_virtual_09_10_147_92_196_hs_metrics",
            "agentName": "blmxx-siteadmin",
            "hostName": "promproxy-us-east-1.aws-prod.example.com",
            "cloud": "blmxx05",
            "service": "hssidsmac",
            "ip": "10.147.92.196",
            "cradle": "09"
        }
    },
.............
]curl -s http://promproxy-us-east-1.aws-prod.example.com:8080/discovery  0.02s user 0.09s system 0% cpu 30.162 total

This also means the the promproxy itself is fanning out for each metric path, each endpoint like /blmxx05_hssidsmac_virtual_09_10_147_92_196_hs_metrics does a fan-out call to

http://10.147.92.196/VIRTUAL/09/hs/metrics

Could this mean the proxy is handling hundreds of simultaneous fan-outs, each triggered by Prometheus’s scrapes — all within the 30s timeout window??

Resulting in

  1. The 30+ second /discovery response time
  2. Eventually the "context deadline exceeded" errors during scrape

I am open for correction in my understanding of the issue, i can try and increase the scrape timeout from 30 s to 60 s, but will that help??? as the proxy itself is throttling??

mhussain584 avatar Jul 02 '25 04:07 mhussain584

I am not sure what you are suggesting WRT to the fan out that is occurring. I am assuming you want the fan out to occur, or are you saying that it should not be happening?

Yeah, let's try increasing the timeout time and see what the effect is.

I am not sure what you mean that the proxy is throttling. If it has to scrape a significant number of endpoints, it sounds like it is not completing in time. I am not aware of any explicit throttling in my code.

pambrose avatar Jul 02 '25 06:07 pambrose

I meant the fan out is happening but too aggressively (may be its my lack of understanding)

However, i've tried and accessing single endpoint from prometheus to promproxy and it took around 15secs

/prometheus $ time wget http://promproxy-us-east-1.aws-prod.example.com:8080/blmxx06_hssidsmac_virtual_05_10_147_232_109_hs_metrics
Connecting to promproxy-us-east-1.aws-prod.example.com:8080 (100.65.70.117:8080)
saving to 'blmxx06_hssidsmac_virtual_05_10_147_232_109_hs_metrics'
blmxx06_hssidsmac_vi 100% |***************************************************************************************************************************| 20982  0:00:00 ETA
'blmxx06_hssidsmac_virtual_05_10_147_232_109_hs_metrics' saved
real	0m 15.18s
user	0m 0.00s
sys	0m 0.00s
/prometheus $

Tried updating the refresh_interval from 30s to default 60s, it didn't help either

Could you please guide me what should be the way forward, so far the prometheus proxy seems to be the one problematic

mhussain584 avatar Jul 02 '25 08:07 mhussain584

Updating the refresh_interval would increase the number of requests, which would not likely help. I was hoping you would increase the Prometheus scrape timeout value and see if that helps.

pambrose avatar Jul 02 '25 15:07 pambrose

My bad, i've reverted the refresh_interval , and increased the scrape_timeout for the job from 30secs to 60secs, it looks a little better for a while (most of the targets stays healthy), but eventually error out after slow response from the proxy hitting the scrape_timeout limit of 1 min

Exception once the targets are unable to scrape within 1 min

Get "http://promproxy-us-east-1.aws-prod.example.com:8080/blmxx06_hssidsmac_virtual_09_10_147_232_104_hs_metrics": context deadline exceeded

mhussain584 avatar Jul 02 '25 16:07 mhussain584

For the sake of experimentation, can you increase the scrape_timeout to 180 secs and see what happens.

The next question is why are we not completing in the allotted amount of time. Do you have a sense for how long each scrape endpoint is taking?

pambrose avatar Jul 02 '25 17:07 pambrose

I looked at the code and I see that the agent is scraping sequentially. It might be interesting to not wait for the scrape completions before moving on to the next scrape and do them concurrently. Let me play wth it.

pambrose avatar Jul 02 '25 17:07 pambrose

I added support for specifying the maximum number of concurrent scrapes and pushed version 2.2.1-beta1 to docker hub. The args are mentioned here. You can use --max_concurrent_scrapes from the CLI, or the env var MAX_CONCURRENT_SCRAPES, or the config variable agent.maxConcurrentScrapes

Try an inital value of 50 or 100 and let's see what happens.

pambrose avatar Jul 03 '25 03:07 pambrose

Sure , let me try it and come back

mhussain584 avatar Jul 03 '25 05:07 mhussain584

I've tried updating the above tag to 2.2.1.beta1 , but on the agent it fails to connect to the promproxy

07:14:28.619 WARN  [Agent.kt:223] - Cannot connect to proxy at promproxy-us-east-1.aws-prod.exmaple.com:50051 StatusException UNAVAILABLE: io exception
Channel Pipeline: [HttpProxyHandler$HttpClientCodecWrapper#0, HttpProxyHandler#0, ProtocolNegotiators$ProxyProtocolNegotiationHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0] [Agent blmxx-siteadmin]
07:14:28.620 INFO  [Agent.kt:230] - Waited 0s to reconnect [Agent blmxx-siteadmin]
07:14:28.620 INFO  [AgentGrpcService.kt:134] - Creating gRPC stubs [Agent blmxx-siteadmin]
07:14:28.627 INFO  [GrpcDsl.kt:75] - Creating connection for gRPC server at promproxy-us-east-1.aws-prod.example.com:50051 using plaintext [Agent blmxx-siteadmin]
07:14:28.628 INFO  [Agent.kt:146] - Resetting agentId [Agent blmxx-siteadmin]
07:14:28.628 INFO  [AgentGrpcService.kt:174] - Connecting to proxy at promproxy-us-east-1.aws-prod.example.com:50051 using plaintext... [Agent blmxx-siteadmin]
07:14:28.766 INFO  [AgentGrpcService.kt:185] - Cannot connect to proxy at promproxy-us-east-1.aws-prod.example.com:50051 using plaintext - StatusException: UNAVAILABLE: io exception

I am using the following variable to connect to promproxy (50051) via my squid (8080)

GRPC_PROXY_EXP=192.x.x.x:8080

Its working with version 2.1.0, so currently i've reverted it back to the same

mhussain584 avatar Jul 03 '25 07:07 mhussain584

Nothing related to making a connection was changed. Can you have another look at it and try it again?

pambrose avatar Jul 03 '25 14:07 pambrose

I've validate it again, upgrading to 2.2.1-beta1 is causing the agent to break the connection with promproxy, 2.1.0 is working for me, so were any connection related changes made in 2.2.0 ?

Error trace

15:00:28.655 ERROR [AgentGrpcService.kt:185] - Cannot connect to proxy at promproxy-us-east-1.aws-prod.example.com:50051 using plaintext - StatusException: UNAVAILABLE: Unable to resolve host promproxy-us-east-1.aws-prod.example.com [Agent blmxx-siteadmin]
Jul 03, 2025 3:00:29 PM io.grpc.internal.ManagedChannelImpl$NameResolverListener handleErrorInSyncContext
WARNING: [Channel<1>: (promproxy-us-east-1.aws-prod.example.com:50051)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host promproxy-us-east-1.aws-prod.example.com, cause=java.lang.RuntimeException: java.net.UnknownHostException: promproxy-us-east-1.aws-prod.example.com
	at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:222)
	at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:281)
	at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.net.UnknownHostException: promproxy-us-east-1.aws-prod.example.com
	at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:801)
	at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1385)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
	at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:642)
	at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:218)
	... 5 more
}
15:00:31.655 INFO  [Agent.kt:237] - Waited 3s to reconnect [Agent blmxx-siteadmin]
15:00:31.656 INFO  [AgentGrpcService.kt:174] - Connecting to proxy at promproxy-us-east-1.aws-prod.example.com:50051 using plaintext... [Agent blmxx-siteadmin]

mhussain584 avatar Jul 03 '25 15:07 mhussain584

No changes have been made in the connection logic in a long time. The stacktrace suggests a DNS or a naming issue. Can you ping promproxy-us-east-1.aws-prod.example.com:50051

pambrose avatar Jul 03 '25 15:07 pambrose

Yes, that is the issue there is no change whatsoever from my end apart from the beta image

curl -v -x http://192.x.x.x:8080 http://promproxy-us-east-1.aws-prod.example.com:50051 --output -
*   Trying 192.x.x.x:8080...
* Connected to 192.x.x.x (192.x.x.x) port 8080
> GET http://promproxy-us-east-1.aws-prod.example.com:50051/ HTTP/1.1
> Host: promproxy-us-east-1.aws-prod.example.com:50051
> User-Agent: curl/8.7.1
> Accept: */*
> Proxy-Connection: Keep-Alive
>
* Request completely sent off
< HTTP/1.1 200 OK
< Server: squid/3.5.12
< Mime-Version: 1.0
< Date: Thu, 03 Jul 2025 16:01:24 GMT
< X-Transformed-From: HTTP/0.9
< X-Cache: MISS from 36ffa2edc61a
< X-Cache-Lookup: MISS from 36ffa2edc61a:3128
< Transfer-Encoding: chunked
< Via: 1.1 36ffa2edc61a (squid/3.5.12)
< Connection: keep-alive
<
* Connection #0 to host 192.x.x.x left intact
���+���Unexpected HTTP/1.x request: GET /

mhussain584 avatar Jul 03 '25 16:07 mhussain584

Hmmm. I am not sure how to proceed. Hostname lookups is not something that I have any control over. What do you think of trying to connect to the server using the uberjar, outside of docker. For example: java -jar prometheus-agent.jar --config conffile.config

If you are able to connect, then that suggests the problem is in the Docker DNS.

pambrose avatar Jul 03 '25 16:07 pambrose

The DNS issue is fixed now, looks like the experimental GRPC_PROXY_EXP has been deprecated and removed from grpc-java 's latest version https://github.com/grpc/grpc-java/pull/11988/files

https://github.com/grpc/grpc-java/releases/tag/v1.73.0

Instead i've used the JAVA_TOOL_OPTIONS

JAVA_TOOL_OPTIONS: "-Dhttps.proxyHost=192.x.x.x -Dhttps.proxyPort=8080"

with this change the DNS worked.

Furthermore applied the new tag 2.2.1-beta1 and set the env variable at agent level

MAX_CONCURRENT_SCRAPES: 100

I see the targets are being scraped but the CPU on promproxy is still hitting over 100% with the following exception in the log

17:28:41.167 INFO  [ProxyHttpConfig.kt:103] -  Throwable caught: ClosedByteChannelException [DefaultDispatcher-worker-22]
io.ktor.utils.io.ClosedByteChannelException: Broken pipe
	at io.ktor.utils.io.CloseToken$wrapCause$1.invoke(CloseToken.kt:16)
	at io.ktor.utils.io.CloseToken$wrapCause$1.invoke(CloseToken.kt:16)
	at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
	at io.ktor.utils.io.CloseToken.wrapCause$default(CloseToken.kt:16)
	at io.ktor.utils.io.ByteChannel.getClosedCause(ByteChannel.kt:61)
	at io.ktor.utils.io.ByteReadChannelOperationsKt.rethrowCloseCauseIfNeeded(ByteReadChannelOperations.kt:543)
	at io.ktor.utils.io.ByteChannel.flush(ByteChannel.kt:94)
	at io.ktor.utils.io.CloseHookByteWriteChannel.flush(CloseHookByteWriteChannel.kt)
	at io.ktor.utils.io.ByteReadChannelOperationsKt.copyTo(ByteReadChannelOperations.kt:206)
	at io.ktor.utils.io.ByteReadChannelOperationsKt$copyTo$2.invokeSuspend(ByteReadChannelOperations.kt)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTaskKt.resume(DispatchedTask.kt:233)
	at kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined(DispatchedTask.kt:175)
	at kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:147)
	at kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:470)
	at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:504)
	at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
	at kotlinx.coroutines.CancellableContinuationImpl.resumeWith(CancellableContinuationImpl.kt:359)
	at io.ktor.utils.io.ByteChannel$Slot$Task$DefaultImpls.resume(ByteChannel.kt:239)
	at io.ktor.utils.io.ByteChannel$Slot$Read.resume(ByteChannel.kt:242)
	at io.ktor.utils.io.ByteChannel.closeSlot(ByteChannel.kt:177)
	at io.ktor.utils.io.ByteChannel.flushAndClose(ByteChannel.kt:133)
	at io.ktor.utils.io.ByteWriteChannelOperationsKt$writer$job$1.invokeSuspend(ByteWriteChannelOperations.kt:184)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
	at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:586)
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:829)
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:717)
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:704)
Caused by: io.ktor.utils.io.ClosedWriteChannelException: Broken pipe
	at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
	at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
	at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
	at io.ktor.utils.io.CloseToken.throwOrNull(CloseToken.kt:26)
	at io.ktor.utils.io.ByteChannel.getWriteBuffer(ByteChannel.kt:54)
	at io.ktor.utils.io.ByteWriteChannelOperationsKt.writeByte(ByteWriteChannelOperations.kt:19)
	at io.ktor.http.cio.internals.CharsKt.writeIntHex(Chars.kt:110)
	at io.ktor.http.cio.ChunkedTransferEncodingKt.writeChunk(ChunkedTransferEncoding.kt:167)
	at io.ktor.http.cio.ChunkedTransferEncodingKt.access$writeChunk(ChunkedTransferEncoding.kt:1)
	at io.ktor.http.cio.ChunkedTransferEncodingKt.encodeChunked(ChunkedTransferEncoding.kt:138)
	at io.ktor.http.cio.ChunkedTransferEncodingKt$encodeChunked$2.invokeSuspend(ChunkedTransferEncoding.kt)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
	at kotlinx.coroutines.EventLoop.processUnconfinedEvent(EventLoop.common.kt:65)
	at kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined(DispatchedTask.kt:243)
	at kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:147)
	at kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:470)
	at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:504)
	at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
	at kotlinx.coroutines.CancellableContinuationImpl.resumeWith(CancellableContinuationImpl.kt:359)
	at io.ktor.utils.io.ByteChannel$Slot$Task$DefaultImpls.resume(ByteChannel.kt:236)
	at io.ktor.utils.io.ByteChannel$Slot$Read.resume(ByteChannel.kt:242)
	at io.ktor.utils.io.ByteChannel.flushWriteBuffer(ByteChannel.kt:389)
	at io.ktor.utils.io.ByteChannel.flush(ByteChannel.kt:96)
	at io.ktor.utils.io.ByteChannel.flushAndClose(ByteChannel.kt:128)
	... 7 common frames omitted
Caused by: io.ktor.utils.io.ClosedWriteChannelException: Broken pipe
	at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
	at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
	at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
	at io.ktor.utils.io.CloseToken.throwOrNull(CloseToken.kt:26)
	at io.ktor.utils.io.ByteChannel.getWriteBuffer(ByteChannel.kt:54)
	at io.ktor.utils.io.ByteReadChannelOperationsKt.copyTo(ByteReadChannelOperations.kt:175)
	at io.ktor.utils.io.ByteReadChannelOperationsKt$copyTo$1.invokeSuspend(ByteReadChannelOperations.kt)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
	at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:113)
	at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:89)
	at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:586)
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:820)
	... 2 common frames omitted
Caused by: java.io.IOException: Broken pipe
	at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:62)
	at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:132)
	at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:97)
	at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:53)
	at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:532)
	at io.ktor.network.sockets.CIOWriterKt$attachForWritingDirectImpl$1.invokeSuspend$lambda$1(CIOWriter.kt:42)
	at io.ktor.utils.io.core.ByteReadPacketExtensions_jvmKt.read(ByteReadPacketExtensions.jvm.kt:30)
	at io.ktor.network.sockets.CIOWriterKt$attachForWritingDirectImpl$1.invokeSuspend(CIOWriter.kt:78)
	... 8 common frames omitted

mhussain584 avatar Jul 03 '25 17:07 mhussain584

Great!

What happens if you reduce the number of endpoints?

pambrose avatar Jul 03 '25 17:07 pambrose