CPU throttling on PromProxy node with 1300+ agents
Hi Paul, Wishes Hope you are doing well
Again thanks for your help with release of 1.23.1 last time which fixed my issue.
Now, i've encountered another issue, where i am sending metrics of 1300+ agents with like 500+ dummy agents, prometheus-proxy service is experiencing CPU throttling
Current Node Resource
8 Cores
32 GB memory
I've upgraded to 2.1.0 for both agent and proxy, still the throttling is evident
PromPorxy logs
16:33:30.299 INFO [CallLogging.kt:48] - 200 OK: GET - /blmxx04_hssidsmac_virtual_09_10_147_233_111_hs_metrics - ip-100-65-70-119.ec2.internal [DefaultDispatcher-worker-31]
16:33:30.302 INFO [CallLogging.kt:48] - 200 OK: GET - /blmxx06_hssidsmac_virtual_09_10_147_233_117_hs_metrics - ip-100-65-70-119.ec2.internal [DefaultDispatcher-worker-36]
16:33:30.305 INFO [CallLogging.kt:48] - 503 Service Unavailable: GET - /blmxx03_hssidsmac_virtual_15_10_147_233_105_hs_metrics - ip-100-65-70-119.ec2.internal [DefaultDispatcher-worker-62]
16:33:30.305 INFO [ProxyHttpConfig.kt:103] - Throwable caught: ClosedByteChannelException [DefaultDispatcher-worker-7]
io.ktor.utils.io.ClosedByteChannelException: Broken pipe
at io.ktor.utils.io.CloseToken$wrapCause$1.invoke(CloseToken.kt:16)
at io.ktor.utils.io.CloseToken$wrapCause$1.invoke(CloseToken.kt:16)
at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
at io.ktor.utils.io.CloseToken.wrapCause$default(CloseToken.kt:16)
at io.ktor.utils.io.ByteChannel.getClosedCause(ByteChannel.kt:61)
at io.ktor.utils.io.ByteReadChannelOperationsKt.rethrowCloseCauseIfNeeded(ByteReadChannelOperations.kt:543)
at io.ktor.utils.io.ByteChannel.flush(ByteChannel.kt:94)
at io.ktor.utils.io.CloseHookByteWriteChannel.flush(CloseHookByteWriteChannel.kt)
at io.ktor.utils.io.ByteReadChannelOperationsKt.copyTo(ByteReadChannelOperations.kt:206)
at io.ktor.utils.io.ByteReadChannelOperationsKt$copyTo$2.invokeSuspend(ByteReadChannelOperations.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTaskKt.resume(DispatchedTask.kt:233)
at kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined(DispatchedTask.kt:175)
at kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:147)
at kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:470)
at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:504)
at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
at kotlinx.coroutines.CancellableContinuationImpl.resumeWith(CancellableContinuationImpl.kt:359)
at io.ktor.utils.io.ByteChannel$Slot$Task$DefaultImpls.resume(ByteChannel.kt:239)
at io.ktor.utils.io.ByteChannel$Slot$Read.resume(ByteChannel.kt:242)
at io.ktor.utils.io.ByteChannel.closeSlot(ByteChannel.kt:177)
at io.ktor.utils.io.ByteChannel.flushAndClose(ByteChannel.kt:133)
at io.ktor.utils.io.ByteWriteChannelOperationsKt$writer$job$1.invokeSuspend(ByteWriteChannelOperations.kt:184)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:586)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:829)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:717)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:704)
Caused by: io.ktor.utils.io.ClosedWriteChannelException: Broken pipe
at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
at io.ktor.utils.io.CloseToken.throwOrNull(CloseToken.kt:26)
at io.ktor.utils.io.ByteChannel.getWriteBuffer(ByteChannel.kt:54)
at io.ktor.utils.io.ByteWriteChannelOperationsKt.writeByte(ByteWriteChannelOperations.kt:19)
at io.ktor.http.cio.internals.CharsKt.writeIntHex(Chars.kt:110)
at io.ktor.http.cio.ChunkedTransferEncodingKt.writeChunk(ChunkedTransferEncoding.kt:167)
at io.ktor.http.cio.ChunkedTransferEncodingKt.access$writeChunk(ChunkedTransferEncoding.kt:1)
at io.ktor.http.cio.ChunkedTransferEncodingKt.encodeChunked(ChunkedTransferEncoding.kt:138)
at io.ktor.http.cio.ChunkedTransferEncodingKt$encodeChunked$2.invokeSuspend(ChunkedTransferEncoding.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
at kotlinx.coroutines.EventLoop.processUnconfinedEvent(EventLoop.common.kt:65)
at kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined(DispatchedTask.kt:243)
at kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:147)
at kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:470)
at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:504)
at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
at kotlinx.coroutines.CancellableContinuationImpl.resumeWith(CancellableContinuationImpl.kt:359)
at io.ktor.utils.io.ByteChannel$Slot$Task$DefaultImpls.resume(ByteChannel.kt:236)
at io.ktor.utils.io.ByteChannel$Slot$Read.resume(ByteChannel.kt:242)
at io.ktor.utils.io.ByteChannel.flushWriteBuffer(ByteChannel.kt:389)
at io.ktor.utils.io.ByteChannel.flush(ByteChannel.kt:96)
at io.ktor.utils.io.ByteChannel.flushAndClose(ByteChannel.kt:128)
... 7 common frames omitted
Caused by: io.ktor.utils.io.ClosedWriteChannelException: Broken pipe
at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
at io.ktor.utils.io.CloseToken.throwOrNull(CloseToken.kt:26)
at io.ktor.utils.io.ByteChannel.getWriteBuffer(ByteChannel.kt:54)
at io.ktor.utils.io.ByteReadChannelOperationsKt.copyTo(ByteReadChannelOperations.kt:175)
at io.ktor.utils.io.ByteReadChannelOperationsKt$copyTo$1.invokeSuspend(ByteReadChannelOperations.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:113)
at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:89)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:586)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:820)
... 2 common frames omitted
Caused by: java.io.IOException: Broken pipe
at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:62)
at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:132)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:97)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:53)
at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:532)
at io.ktor.network.sockets.CIOWriterKt$attachForWritingDirectImpl$1.invokeSuspend$lambda$1(CIOWriter.kt:42)
at io.ktor.utils.io.core.ByteReadPacketExtensions_jvmKt.read(ByteReadPacketExtensions.jvm.kt:30)
at io.ktor.network.sockets.CIOWriterKt$attachForWritingDirectImpl$1.invokeSuspend(CIOWriter.kt:78)
... 8 common frames omitted
16:33:30.306 INFO [CallLogging.kt:48] - 200 OK: GET - /blmxx03_hssidsmac_virtual_04_10_147_234_71_hs_metrics - ip-100-65-70-119.ec2.internal [DefaultDispatcher-worker-7]
Container resource usage
Prom-agent.conf (Just a snippet)
proxy {
admin.debugEnabled = true
admin.enabled: true
metrics.enabled: true
http.requestLoggingEnabled: true
}
agent {
proxy.hostname: "promproxy-us-east-1.aws-prod.examplecom:50051"
admin.enabled: true
metrics.enabled: true
pathConfigs: [
{
name: "Agent System metrics"
path: blmxx_agent_sys_metrics
url: "http://10.147.92.199:9100/metrics"
},
{
name: "Agent metrics"
path: blmxx_agent_metrics
labels: "{\"cloud\": \"blmxx\"}"
url: "http://localhost:8083/metrics"
},
{
name: "blmxx01.iosserver-5xxxxxxxx38.app_metrics"
path: "blmxx01_iosserver_app_metrics_10_147_234_98_9091"
labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\"}"
url: "http://10.147.234.98:9091/metrics"
},
{
name: "blmxx01.iosserver-5xxxxxxxx38.sys_metrics"
path: "blmxx01_iosserver_sys_metrics_10_147_234_98_9100"
labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\"}"
url: "http://10.147.234.98:9100/metrics"
}
,
{
name: "blmxx01.iosserver-5xxxxxxxx38VIRTUAL_01"
path: "blmxx01_iosserver_virtual_01_10_147_234_98_hs_metrics"
labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"01\"}"
url: "http://10.147.234.98/VIRTUAL/01/hs/metrics"
},
{
name: "blmxx01.iosserver-5xxxxxxxx38.VIRTUAL_02"
path: "blmxx01_iosserver_virtual_02_10_147_234_98_hs_metrics"
labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"02\"}"
url: "http://10.147.234.98/VIRTUAL/02/hs/metrics"
},
{
name: "blmcm01.iosserver-5xxxxxxxx38.VIRTUAL_03"
path: "blmxx01_iosserver_virtual_03_10_147_234_98_hs_metrics"
labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"03\"}"
url: "http://10.147.234.98/VIRTUAL/03/hs/metrics"
},
{
name: "blmxx01.iosserver-5xxxxxxxx38.VIRTUAL_04"
path: "blmxx01_iosserver_virtual_04_10_147_234_98_hs_metrics"
labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"04\"}"
url: "http://10.147.234.98/VIRTUAL/04/hs/metrics"
},
{
name: "blmxx01.iosserver-5xxxxxxxx38.VIRTUAL_05"
path: "blmxx01_iosserver_virtual_05_10_147_234_98_hs_metrics"
labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"05\"}"
url: "http://10.147.234.98/VIRTUAL/05/hs/metrics"
},
{
name: "blmxx0.iosserver-5xxxxxxxx38.VIRTUAL_06"
path: "blmxx01_iosserver_virtual_06_10_147_234_98_hs_metrics"
labels: "{\"cloud\": \"blmxx01\", \"service\": \"iosserver\", \"ip\": \"10.147.234.98\", \"cradle\": \"06\"}"
url: "http://10.147.234.98/VIRTUAL/06/hs/metrics"
},
{
Your help on this will be appreciated
Thanks
Weird. So with fewer agents you are not seeing problems?
Can you RE-open the case i've accidently closed it
Weird. So with fewer agents you are not seeing problems?
Yes , with fewer agents it works fine
Could it be due to unlimited coroutines getting created for large number of paths leading to this?
Does the proxy fans out to all registered paths with unbounded coroutines?
Probably the dispatcher is getting overwhelmed due to these thousands of coroutines, i've even tried to upscale the node to 16 cores but capacity doesn't seem to be the issue
Also, i've tried to run a benchmark (ab), with single path and it works fine , looks like hundreds of paths / endpoints (coroutines) for agents are saturating it
➜ ~ ab -n 1500 -c 100 http://promproxy-us-east-1.aws-prod.example.com:8080/metrics
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking promproxy-us-east-1.aws-prod.example.com (be patient)
Completed 150 requests
Completed 300 requests
Completed 450 requests
Completed 600 requests
Completed 750 requests
Completed 900 requests
Completed 1050 requests
Completed 1200 requests
Completed 1350 requests
Completed 1500 requests
Finished 1500 requests
Server Software: Ktor/3.1.1
Server Hostname: promproxy-us-east-1.aws-prod.example.com
Server Port: 8080
Document Path: /metrics
Document Length: 13 bytes
Concurrency Level: 100
Time taken for tests: 11.734 seconds
Complete requests: 1500
Failed requests: 0
Non-2xx responses: 1500
Total transferred: 382500 bytes
HTML transferred: 19500 bytes
Requests per second: 127.84 [#/sec] (mean)
Time per request: 782.251 [ms] (mean)
Time per request: 7.823 [ms] (mean, across all concurrent requests)
Transfer rate: 31.83 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 356 361 4.5 359 384
Processing: 357 368 11.6 363 420
Waiting: 357 367 10.7 363 419
Total: 714 729 13.4 723 777
Percentage of the requests served within a certain time (ms)
50% 723
66% 726
75% 735
80% 740
90% 750
95% 758
98% 765
99% 772
100% 777 (longest request)
➜ ~
Are you running the proxy in a docker container? I am wondering if we need to adjust the jvm flags.
Yes , I am running it in a docker container
What JVM conf should be the appropriate?
-e JAVA_OPTS="-Xms4g -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=100"
Or may be
docker run --rm \
-p 8082:8082 -p 8092:8092 -p 50051:50051 -p 8080:8080 \
--env ADMIN_ENABLED=true \
--env METRICS_ENABLED=true \
--env JAVA_TOOL_OPTIONS="-Xms4g -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=100" \
pambrose/prometheus-proxy:2.1.0
Coroutines are lightweight, so I am not worried about that.
My guess is the issue is either with the JVM settings or the gRPC settings. I see the failure in the stacktrace above, but do we have any other clues about what the failure is?
And when you say there is throttling, are you referring to the failure being seen at a sertain level of clients, i.e., not something more specific than that?
I am able to see those error traces when the CPU scores over 100% and stays there for a while, other than i cannot find any other exception , i can take your help to validate JVM settings, anything else that i am missing
/app $ java -XX:+PrintFlagsFinal -version | grep HeapSize
size_t ErgoHeapSizeLimit = 0 {product} {default}
size_t HeapSizePerGCThread = 43620760 {product} {default}
size_t InitialHeapSize = 515899392 {product} {ergonomic}
size_t LargePageHeapSizeThreshold = 134217728 {product} {default}
size_t MaxHeapSize = 8229224448 {product} {ergonomic}
size_t MinHeapSize = 8388608 {product} {ergonomic}
uintx NonNMethodCodeHeapSize = 7602480 {pd product} {ergonomic}
uintx NonProfiledCodeHeapSize = 122027880 {pd product} {ergonomic}
uintx ProfiledCodeHeapSize = 122027880 {pd product} {ergonomic}
size_t SoftMaxHeapSize = 8229224448 {manageable} {ergonomic}
openjdk version "17.0.14" 2025-01-21
OpenJDK Runtime Environment (build 17.0.14+7-alpine-r0)
OpenJDK 64-Bit Server VM (build 17.0.14+7-alpine-r0, mixed mode, sharing)
/app $
And when you say there is throttling, are you referring to the failure being seen at a sertain level of clients, i.e., not something more specific than that?
The failure is not particular with certain clients, as i said earlier, there are 1300+ endpoints out of which some are dummy and its not that the failure is occurring on the dummy ones
Hi @pambrose I work with @mhussain584 , so know the context behind this issue. Just 1 correction here - We have only 1 prom-agent, but it has 1300+ different pathConfigs, (not 1300 agents)
About your question, we haven't checked if failures are seen on certain number of clients. It was working fine with around 75 pathConfigs, today we added 1250+ more pathConfigs, and saw this issue. Prometheus gives scrape error as -
Get "http://promproxy-us-east-1.aws-prod.example.com:8080/blmxxx_hssidsmac_virtual_11_10_147_232_104_hs_metrics": context deadline exceeded
Hi @pkazi,
That error looks helpful. Do we know who is giving that message? When I google "context deadline exceeded" I see all the usual suspects. It doesn't sound JVM or gRPC related, though.
Does this look like a possibility? https://signoz.io/guides/context-deadline-exceeded-prometheus/
Also, if we are unable to get to the bottom of this, would it be possible to run multiple prom-agents and distribute the 1300+ pathConfigs across them? Do not get me wrong, I want to identify the underlying issue.
Thanks for the correction @pkazi , @pambrose The context deadline exceeded error above seems to be the cascading effect of what is happening in the proxy is my assumption, as my scrape job config consists of following
- job_name: hy_device_metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 30s
metrics_path: /{__metrics_path__}
scheme: http
follow_redirects: true
enable_http2: true
relabel_configs:
- source_labels: [__meta_sd_labels]
separator: ;
regex: .*cloud=(.+?),.*service=(.+?),.*ip=(.+?)
target_label: cloud
replacement: $1
action: replace
- source_labels: [__meta_sd_labels]
separator: ;
regex: .*cloud=(.+?),.*service=(.+?),.*ip=(.+?)
target_label: service
replacement: $2
action: replace
- source_labels: [__meta_sd_labels]
separator: ;
regex: .*cloud=(.+?),.*service=(.+?),.*ip=(.+?)
target_label: ip
replacement: $3
action: replace
http_sd_configs:
- follow_redirects: true
enable_http2: true
refresh_interval: 30s
url: http://promproxy-us-east-1.aws-prod.example.com:8080/discovery
So, when i explicitly try to curl http://promproxy-us-east-1.aws-prod.example.com:8080/discovery , Prometheus is scraping http://proxy:8080/{metrics_path} per target
Which is causing prometheus to call for every cradle, every IP, every service, etc
http://promproxy-us-east-1.aws-prod.example.com:8080/blmcm05_hssidsmac_virtual_09_10_147_92_196_hs_metrics
➜ ~ time curl -s http://promproxy-us-east-1.aws-prod.example.com:8080/discovery
overy
[
{
"targets": [
"promproxy-us-east-1.aws-prod.example.com:8080"
],
"labels": {
"__metrics_path__": "blmxx05_hssidsmac_virtual_09_10_147_92_196_hs_metrics",
"agentName": "blmxx-siteadmin",
"hostName": "promproxy-us-east-1.aws-prod.example.com",
"cloud": "blmxx05",
"service": "hssidsmac",
"ip": "10.147.92.196",
"cradle": "09"
}
},
.............
]curl -s http://promproxy-us-east-1.aws-prod.example.com:8080/discovery 0.02s user 0.09s system 0% cpu 30.162 total
This also means the the promproxy itself is fanning out for each metric path, each endpoint like /blmxx05_hssidsmac_virtual_09_10_147_92_196_hs_metrics does a fan-out call to
http://10.147.92.196/VIRTUAL/09/hs/metrics
Could this mean the proxy is handling hundreds of simultaneous fan-outs, each triggered by Prometheus’s scrapes — all within the 30s timeout window??
Resulting in
- The 30+ second /discovery response time
- Eventually the "context deadline exceeded" errors during scrape
I am open for correction in my understanding of the issue, i can try and increase the scrape timeout from 30 s to 60 s, but will that help??? as the proxy itself is throttling??
I am not sure what you are suggesting WRT to the fan out that is occurring. I am assuming you want the fan out to occur, or are you saying that it should not be happening?
Yeah, let's try increasing the timeout time and see what the effect is.
I am not sure what you mean that the proxy is throttling. If it has to scrape a significant number of endpoints, it sounds like it is not completing in time. I am not aware of any explicit throttling in my code.
I meant the fan out is happening but too aggressively (may be its my lack of understanding)
However, i've tried and accessing single endpoint from prometheus to promproxy and it took around 15secs
/prometheus $ time wget http://promproxy-us-east-1.aws-prod.example.com:8080/blmxx06_hssidsmac_virtual_05_10_147_232_109_hs_metrics
Connecting to promproxy-us-east-1.aws-prod.example.com:8080 (100.65.70.117:8080)
saving to 'blmxx06_hssidsmac_virtual_05_10_147_232_109_hs_metrics'
blmxx06_hssidsmac_vi 100% |***************************************************************************************************************************| 20982 0:00:00 ETA
'blmxx06_hssidsmac_virtual_05_10_147_232_109_hs_metrics' saved
real 0m 15.18s
user 0m 0.00s
sys 0m 0.00s
/prometheus $
Tried updating the refresh_interval from 30s to default 60s, it didn't help either
Could you please guide me what should be the way forward, so far the prometheus proxy seems to be the one problematic
Updating the refresh_interval would increase the number of requests, which would not likely help. I was hoping you would increase the Prometheus scrape timeout value and see if that helps.
My bad, i've reverted the refresh_interval , and increased the scrape_timeout for the job from 30secs to 60secs, it looks a little better for a while (most of the targets stays healthy), but eventually error out after slow response from the proxy hitting the scrape_timeout limit of 1 min
Exception once the targets are unable to scrape within 1 min
Get "http://promproxy-us-east-1.aws-prod.example.com:8080/blmxx06_hssidsmac_virtual_09_10_147_232_104_hs_metrics": context deadline exceeded
For the sake of experimentation, can you increase the scrape_timeout to 180 secs and see what happens.
The next question is why are we not completing in the allotted amount of time. Do you have a sense for how long each scrape endpoint is taking?
I looked at the code and I see that the agent is scraping sequentially. It might be interesting to not wait for the scrape completions before moving on to the next scrape and do them concurrently. Let me play wth it.
I added support for specifying the maximum number of concurrent scrapes and pushed version 2.2.1-beta1 to docker hub. The args are mentioned here. You can use --max_concurrent_scrapes from the CLI, or the env var MAX_CONCURRENT_SCRAPES, or the config variable agent.maxConcurrentScrapes
Try an inital value of 50 or 100 and let's see what happens.
Sure , let me try it and come back
I've tried updating the above tag to 2.2.1.beta1 , but on the agent it fails to connect to the promproxy
07:14:28.619 WARN [Agent.kt:223] - Cannot connect to proxy at promproxy-us-east-1.aws-prod.exmaple.com:50051 StatusException UNAVAILABLE: io exception
Channel Pipeline: [HttpProxyHandler$HttpClientCodecWrapper#0, HttpProxyHandler#0, ProtocolNegotiators$ProxyProtocolNegotiationHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0] [Agent blmxx-siteadmin]
07:14:28.620 INFO [Agent.kt:230] - Waited 0s to reconnect [Agent blmxx-siteadmin]
07:14:28.620 INFO [AgentGrpcService.kt:134] - Creating gRPC stubs [Agent blmxx-siteadmin]
07:14:28.627 INFO [GrpcDsl.kt:75] - Creating connection for gRPC server at promproxy-us-east-1.aws-prod.example.com:50051 using plaintext [Agent blmxx-siteadmin]
07:14:28.628 INFO [Agent.kt:146] - Resetting agentId [Agent blmxx-siteadmin]
07:14:28.628 INFO [AgentGrpcService.kt:174] - Connecting to proxy at promproxy-us-east-1.aws-prod.example.com:50051 using plaintext... [Agent blmxx-siteadmin]
07:14:28.766 INFO [AgentGrpcService.kt:185] - Cannot connect to proxy at promproxy-us-east-1.aws-prod.example.com:50051 using plaintext - StatusException: UNAVAILABLE: io exception
I am using the following variable to connect to promproxy (50051) via my squid (8080)
GRPC_PROXY_EXP=192.x.x.x:8080
Its working with version 2.1.0, so currently i've reverted it back to the same
Nothing related to making a connection was changed. Can you have another look at it and try it again?
I've validate it again, upgrading to 2.2.1-beta1 is causing the agent to break the connection with promproxy, 2.1.0 is working for me, so were any connection related changes made in 2.2.0 ?
Error trace
15:00:28.655 ERROR [AgentGrpcService.kt:185] - Cannot connect to proxy at promproxy-us-east-1.aws-prod.example.com:50051 using plaintext - StatusException: UNAVAILABLE: Unable to resolve host promproxy-us-east-1.aws-prod.example.com [Agent blmxx-siteadmin]
Jul 03, 2025 3:00:29 PM io.grpc.internal.ManagedChannelImpl$NameResolverListener handleErrorInSyncContext
WARNING: [Channel<1>: (promproxy-us-east-1.aws-prod.example.com:50051)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host promproxy-us-east-1.aws-prod.example.com, cause=java.lang.RuntimeException: java.net.UnknownHostException: promproxy-us-east-1.aws-prod.example.com
at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:222)
at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:281)
at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.net.UnknownHostException: promproxy-us-east-1.aws-prod.example.com
at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:801)
at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1385)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:642)
at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:218)
... 5 more
}
15:00:31.655 INFO [Agent.kt:237] - Waited 3s to reconnect [Agent blmxx-siteadmin]
15:00:31.656 INFO [AgentGrpcService.kt:174] - Connecting to proxy at promproxy-us-east-1.aws-prod.example.com:50051 using plaintext... [Agent blmxx-siteadmin]
No changes have been made in the connection logic in a long time. The stacktrace suggests a DNS or a naming issue. Can you ping promproxy-us-east-1.aws-prod.example.com:50051
Yes, that is the issue there is no change whatsoever from my end apart from the beta image
curl -v -x http://192.x.x.x:8080 http://promproxy-us-east-1.aws-prod.example.com:50051 --output -
* Trying 192.x.x.x:8080...
* Connected to 192.x.x.x (192.x.x.x) port 8080
> GET http://promproxy-us-east-1.aws-prod.example.com:50051/ HTTP/1.1
> Host: promproxy-us-east-1.aws-prod.example.com:50051
> User-Agent: curl/8.7.1
> Accept: */*
> Proxy-Connection: Keep-Alive
>
* Request completely sent off
< HTTP/1.1 200 OK
< Server: squid/3.5.12
< Mime-Version: 1.0
< Date: Thu, 03 Jul 2025 16:01:24 GMT
< X-Transformed-From: HTTP/0.9
< X-Cache: MISS from 36ffa2edc61a
< X-Cache-Lookup: MISS from 36ffa2edc61a:3128
< Transfer-Encoding: chunked
< Via: 1.1 36ffa2edc61a (squid/3.5.12)
< Connection: keep-alive
<
* Connection #0 to host 192.x.x.x left intact
���+���Unexpected HTTP/1.x request: GET /
Hmmm. I am not sure how to proceed. Hostname lookups is not something that I have any control over. What do you think of trying to connect to the server using the uberjar, outside of docker. For example: java -jar prometheus-agent.jar --config conffile.config
If you are able to connect, then that suggests the problem is in the Docker DNS.
The DNS issue is fixed now, looks like the experimental GRPC_PROXY_EXP has been deprecated and removed from grpc-java 's latest version https://github.com/grpc/grpc-java/pull/11988/files
https://github.com/grpc/grpc-java/releases/tag/v1.73.0
Instead i've used the JAVA_TOOL_OPTIONS
JAVA_TOOL_OPTIONS: "-Dhttps.proxyHost=192.x.x.x -Dhttps.proxyPort=8080"
with this change the DNS worked.
Furthermore applied the new tag 2.2.1-beta1 and set the env variable at agent level
MAX_CONCURRENT_SCRAPES: 100
I see the targets are being scraped but the CPU on promproxy is still hitting over 100% with the following exception in the log
17:28:41.167 INFO [ProxyHttpConfig.kt:103] - Throwable caught: ClosedByteChannelException [DefaultDispatcher-worker-22]
io.ktor.utils.io.ClosedByteChannelException: Broken pipe
at io.ktor.utils.io.CloseToken$wrapCause$1.invoke(CloseToken.kt:16)
at io.ktor.utils.io.CloseToken$wrapCause$1.invoke(CloseToken.kt:16)
at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
at io.ktor.utils.io.CloseToken.wrapCause$default(CloseToken.kt:16)
at io.ktor.utils.io.ByteChannel.getClosedCause(ByteChannel.kt:61)
at io.ktor.utils.io.ByteReadChannelOperationsKt.rethrowCloseCauseIfNeeded(ByteReadChannelOperations.kt:543)
at io.ktor.utils.io.ByteChannel.flush(ByteChannel.kt:94)
at io.ktor.utils.io.CloseHookByteWriteChannel.flush(CloseHookByteWriteChannel.kt)
at io.ktor.utils.io.ByteReadChannelOperationsKt.copyTo(ByteReadChannelOperations.kt:206)
at io.ktor.utils.io.ByteReadChannelOperationsKt$copyTo$2.invokeSuspend(ByteReadChannelOperations.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTaskKt.resume(DispatchedTask.kt:233)
at kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined(DispatchedTask.kt:175)
at kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:147)
at kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:470)
at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:504)
at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
at kotlinx.coroutines.CancellableContinuationImpl.resumeWith(CancellableContinuationImpl.kt:359)
at io.ktor.utils.io.ByteChannel$Slot$Task$DefaultImpls.resume(ByteChannel.kt:239)
at io.ktor.utils.io.ByteChannel$Slot$Read.resume(ByteChannel.kt:242)
at io.ktor.utils.io.ByteChannel.closeSlot(ByteChannel.kt:177)
at io.ktor.utils.io.ByteChannel.flushAndClose(ByteChannel.kt:133)
at io.ktor.utils.io.ByteWriteChannelOperationsKt$writer$job$1.invokeSuspend(ByteWriteChannelOperations.kt:184)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:586)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:829)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:717)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:704)
Caused by: io.ktor.utils.io.ClosedWriteChannelException: Broken pipe
at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
at io.ktor.utils.io.CloseToken.throwOrNull(CloseToken.kt:26)
at io.ktor.utils.io.ByteChannel.getWriteBuffer(ByteChannel.kt:54)
at io.ktor.utils.io.ByteWriteChannelOperationsKt.writeByte(ByteWriteChannelOperations.kt:19)
at io.ktor.http.cio.internals.CharsKt.writeIntHex(Chars.kt:110)
at io.ktor.http.cio.ChunkedTransferEncodingKt.writeChunk(ChunkedTransferEncoding.kt:167)
at io.ktor.http.cio.ChunkedTransferEncodingKt.access$writeChunk(ChunkedTransferEncoding.kt:1)
at io.ktor.http.cio.ChunkedTransferEncodingKt.encodeChunked(ChunkedTransferEncoding.kt:138)
at io.ktor.http.cio.ChunkedTransferEncodingKt$encodeChunked$2.invokeSuspend(ChunkedTransferEncoding.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
at kotlinx.coroutines.EventLoop.processUnconfinedEvent(EventLoop.common.kt:65)
at kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined(DispatchedTask.kt:243)
at kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:147)
at kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:470)
at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:504)
at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
at kotlinx.coroutines.CancellableContinuationImpl.resumeWith(CancellableContinuationImpl.kt:359)
at io.ktor.utils.io.ByteChannel$Slot$Task$DefaultImpls.resume(ByteChannel.kt:236)
at io.ktor.utils.io.ByteChannel$Slot$Read.resume(ByteChannel.kt:242)
at io.ktor.utils.io.ByteChannel.flushWriteBuffer(ByteChannel.kt:389)
at io.ktor.utils.io.ByteChannel.flush(ByteChannel.kt:96)
at io.ktor.utils.io.ByteChannel.flushAndClose(ByteChannel.kt:128)
... 7 common frames omitted
Caused by: io.ktor.utils.io.ClosedWriteChannelException: Broken pipe
at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
at io.ktor.utils.io.ByteChannel$writeBuffer$1.invoke(ByteChannel.kt:54)
at io.ktor.utils.io.CloseToken.wrapCause(CloseToken.kt:21)
at io.ktor.utils.io.CloseToken.throwOrNull(CloseToken.kt:26)
at io.ktor.utils.io.ByteChannel.getWriteBuffer(ByteChannel.kt:54)
at io.ktor.utils.io.ByteReadChannelOperationsKt.copyTo(ByteReadChannelOperations.kt:175)
at io.ktor.utils.io.ByteReadChannelOperationsKt$copyTo$1.invokeSuspend(ByteReadChannelOperations.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100)
at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:113)
at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:89)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:586)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:820)
... 2 common frames omitted
Caused by: java.io.IOException: Broken pipe
at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:62)
at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:132)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:97)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:53)
at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:532)
at io.ktor.network.sockets.CIOWriterKt$attachForWritingDirectImpl$1.invokeSuspend$lambda$1(CIOWriter.kt:42)
at io.ktor.utils.io.core.ByteReadPacketExtensions_jvmKt.read(ByteReadPacketExtensions.jvm.kt:30)
at io.ktor.network.sockets.CIOWriterKt$attachForWritingDirectImpl$1.invokeSuspend(CIOWriter.kt:78)
... 8 common frames omitted
Great!
What happens if you reduce the number of endpoints?