gnmic icon indicating copy to clipboard operation
gnmic copied to clipboard

I think Prometheus output have a memory leak that crashes the server it runs on.

Open Sparc0 opened this issue 2 years ago • 38 comments

We have noticed a strange behavior when we use any of the prometheus outputs that the ram usage of gnmic just keep going up and up until the server starts to swap and become unresponsive.

It works fine if we have around 20-30 devices but if we try to do 400+ devices we start getting problem. And thats just half of the devices we have. I tried strings-as-labels both false and true. For the scrape based prometheus i tried lower to cache timer does not work. For the remote write i played with buffer size, interval, max-time-series-per-write nothing helps. I do not have any event-processors.

If i use Kafka or file outputs instead i do not see a continuously increase in RAM usage.

CPU usage is ~25% 5min load average. Metrics of RAM usage and below is a disk usage where you can see when we changed to file output that ram does not increase. In the picture you can see at ~17:20 ram usage whent up a little that was when we added Kafka output.
image

insecure: true
encoding: PROTO
log: true
gzip: false
timeout: 60s
debug: false

api-server:
  address: :7890
  skip-verify: true
  enable-metrics: true


loader:
  type: http
  timeout: 60s
  enable-metrics: true
  tls:
    skip-verify: true
  url: "https://gnmic-inventory/gnmic/"


subscriptions:
  juniper:
    stream-subscriptions:
      - paths:
        - "/junos/system/linecard/interface"
        - "/interfaces/interface/ethernet/state/counters"
        stream-mode: sample
        sample-interval: 30s
      - paths:
        - "/interfaces/interface/state"
        - "/interfaces/interface/ethernet/state"
        stream-mode: sample
        sample-interval: 720s
      - paths:
        - "/interfaces/interface/state/oper-status"
        - "/interfaces/interface/state/admin-status"
        stream-mode: on-change
      - paths:
        - "/system/alarms"
        stream-mode: on-change
      - paths:
        - "/components/component"
        stream-mode: sample
        sample-interval: 60s
    encoding: proto
    mode: stream

outputs:
  prom-remote:
    type: prometheus_write
    url: https://collector-cortex/api/prom/push
    authorization:
      type: Bearer
      credentials: SECRET
    metric-prefix: gnmic
    buffer-size: 10000
    max-time-series-per-write: 1500
    interval: 10s
    max-retries: 0
    strings-as-labels: true
    enable-metrics: false
    tls:
    skip-verify: true
    debug: false

  file:
    type: file
    filename: /tmp/metrics.json
    format: event

  kafka-output:
    type: kafka
    name: gnmic
    address: REMOVED
    topic: metrics.juniper
    sasl:
      user: gnmic
      password: REMOVED
      mechanism: SCRAM-SHA-512
    tls:
      skip-verify: true
    format: event
    override-timestamps: false
    num-workers: 3
    debug: false


Sparc0 avatar Oct 02 '23 16:10 Sparc0

For the scrape based prometheus output, what was the scrape period on the prometheus side ? For the remote write one, which values of interval and buffer did you try out ? Also, can you set debug: true under the prometheus output section and share the logs ?

karimra avatar Oct 02 '23 17:10 karimra

Scrape based was 15s intervals, scraped around 140-170k metrics each time.

Remote write i tried alot. intervals i tried 0.5,1,5,10s buffer-size: 1000,10000,30000,100000,400000 max-time-series-per-write: 500,1500,150000,200000 Debug log is full of this types of messages.

2023/10/03 16:40:44.478213 [prometheus_write_output:prom-remote] buffer size reached, triggering write
2023/10/03 16:40:44.478220 [prometheus_write_output:prom-remote] saving metrics metadata
2023/10/03 16:40:44.478222 [prometheus_write_output:prom-remote] writing TimeSeries to buffer
2023/10/03 16:40:44.478230 [prometheus_write_output:prom-remote] buffer full, writing to remote
2023/10/03 16:40:44.478239 [prometheus_write_output:prom-remote] write triggered, buffer size: 1000
2023/10/03 16:40:44.481562 [prometheus_write_output:prom-remote] writing a 500 time series chunk
2023/10/03 16:40:44.485274 [prometheus_write_output:prom-remote] got event to buffer: {"name":"juniper","timestamp":1696344042153000000,"tags":{"...........}
2023/10/03 16:41:02.644834 [prometheus_write_output:prom-remote] write response failed, code=400, body=user=Networks: err: out of order sample. timestamp=2023-10-03T14:40:45.289Z, series={__name__="gnmic_interfaces_interface_init_time",}

Higher buffer-size and max-time-series, still same problem with increasing ram.

2023/10/03 17:21:23.135720 [prometheus_write_output:prom-remote] write interval reached, writing to remote
2023/10/03 17:21:23.135783 [prometheus_write_output:prom-remote] write triggered, buffer size: 67980
2023/10/03 17:21:23.135736 [prometheus_write_output:prom-remote] saving metrics metadata
2023/10/03 17:21:23.135875 [prometheus_write_output:prom-remote] writing TimeSeries to buffer
2023/10/03 17:21:23.135954 [prometheus_write_output:prom-remote] got event to buffer: {"name":"juniper",

Sparc0 avatar Oct 03 '23 15:10 Sparc0

I was able to reproduce your issue, and I believe fix it in #235. The root cause is the amount of gNMI subscribe responses that get passed to the outputs to handle. The kafka output (and other outputs) can take advantage of multiple workers to handle the responses (the num-workers: 3 in your example). While prometheus and prometheus_write outputs handle messages in sequence. The messages end up being queued and take up all the memory.

I added the option to add workers to both outputs as well as a timeout for queued messages.

I tested both prometheus and prometheus_write outputs with num-workers: 10 while gNMIc subscribes to 400 targets sending each 500 gNMI Subscribe Responses per second every 10s (totaling 200k metrics). I see a constant memory usage of ~600MB and ~450MB for prometheus and prometheus_write respectively.

Do you have an idea of the rate at which your 400 devices send gNMI notifications ? The metric should be available under gNMIc internal metrics.

karimra avatar Oct 04 '23 02:10 karimra

My grpc_client_msg_received_total metric says between 10-12k msg every 15s

I compiled the fixed version and i have been running it now stable for 2h+.

A little bump in memory usage but that's expected. image

Sparc0 avatar Oct 04 '23 08:10 Sparc0

Thanks for testing it. If you don't mind sharing: how many workers did you set? Did you test both prometheus and prometheus_write? How much was the memory usage bump ?

karimra avatar Oct 04 '23 15:10 karimra

We hit another issue as you can see in the picture above, our interface is only a 1Gb interface. Will comeback with our final configuration when we solve that.

Our log is full of, but i believe its because of the interface.

2023/10/04 14:34:12.474788 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.474813 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.474585 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.474970 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.474980 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.475001 [prometheus_write_output:prom-remote] writing expired after 15s
2023/10/04 14:34:12.475014 [prometheus_write_output:prom-remote] writing expired after 15s

Would it be possible to get some prometheus write output metrics like there is for kafka to correlate with the grpc_client_msg_received_total

e.g number_of_prometheus_write_msgs_sent_success_total number_of_prometheus_write_msgs_sent_fail_total

Sparc0 avatar Oct 05 '23 12:10 Sparc0

Added a few metrics to both outputs:

gnmic_prometheus_output_number_of_prometheus_metrics_total 200000
gnmic_prometheus_write_output_metadata_msg_send_duration_ns 949876
gnmic_prometheus_write_output_msg_send_duration_ns 909873
gnmic_prometheus_write_output_number_of_prometheus_write_metadata_msgs_sent_fail_total{reason=""} 0
gnmic_prometheus_write_output_number_of_prometheus_write_metadata_msgs_sent_success_total 200
gnmic_prometheus_write_output_number_of_prometheus_write_msgs_sent_fail_total{reason=""} 0
gnmic_prometheus_write_output_number_of_prometheus_write_msgs_sent_success_total 1e+06

karimra avatar Oct 05 '23 21:10 karimra

Only been trying prometheus_write. Memory bump seems to be around 4-6Gb.

We are still seeing [prometheus_write_output:prom-remote] writing expired after 20s. Waiting for our monitor team help us troubleshoot there side and help us tweak gnmic parameters even more.

Some config changes we done. Until 15:54

    buffer-size: 10000000
    max-time-series-per-write: 10000
    num-workers: 20
    interval: 5s
    timeout: 20s

15:54 -16:07

    buffer-size: 10000000
    max-time-series-per-write: 10000
    num-workers: 10
    interval: 1s
    timeout: 20s

16:07+ (at 16:25 i changed to jumboframes only)

    buffer-size: 1000000
    max-time-series-per-write: 2000
    num-workers: 10
    interval: 1s
    timeout: 20s

latest config gave us less timeouts and we continuously having more total writes towards prometheus then incomming grpc messages. image

Sparc0 avatar Oct 09 '23 15:10 Sparc0

I think having more writes towards prometheus than incoming gRPC messages makes sense. A single Subscribe Response can hold multiple updates and those updates could have JSON values that map to multiple prometheus metrics.

karimra avatar Oct 09 '23 15:10 karimra

Ye that's what i thought also. How are the workers working internally? e.g Are they working independently or working together to process and write messages?

Sparc0 avatar Oct 09 '23 15:10 Sparc0

Yes workers are independent, they handle incoming messages by converting them to prometheus metrics and write them to a buffer. Then there is a single goroutine that drains the buffer periodically or when it reaches the max buffer size and writes the metrics to the remote prometheus server. Timeouts happen when there is no worker available to handle messages, which implies that all workers are pending to write to the buffer. That means the buffer is probably full because it's not being drained fast enough. This is, I think, confirmed by your config changes at 15:54. Going from interval 5s to 1s allowed the number of messages written to prometheus to climb up again above the number of received gRPC messages.

karimra avatar Oct 09 '23 16:10 karimra

Can that single goroutine that drains the buffer be a bottleneck? Just setting a low interval e.g 1s or even sub-second i dont feel is enough to drain the buffer fast enough. And having high buffer-size and max-time-series-per-write will cause the write duration to become higher then the interval.

It feels like i am not able to get more then 13-16k of written messages to Prometheus per second. Having 10 workers or 200 workers is also not noticeable in the output of messages.

Kafka workers they don't use a shared buffer like prometheus i guess? Since there is a metric created for each producer_id i guess they take care of the processing and writing.

Sparc0 avatar Oct 09 '23 17:10 Sparc0

Can that single goroutine that drains the buffer be a bottleneck? Just setting a low interval e.g 1s or even sub-second i dont feel is enough to drain the buffer fast enough. And having high buffer-size and max-time-series-per-write will cause the write duration to become higher then the interval.

The reason there is a single goroutine writing to prometheus is the in-order requirement from prometheus side, see here. If we use multiple workers, there is no way to guarantee that metrics part of the same series are sent in order.

It feels like i am not able to get more then 13-16k of written messages to Prometheus per second. Having 10 workers or 200 workers is also not noticeable in the output of messages.

The workers are used on the input side, their number will not have an impact on the output performance. I would try with low interval and a max-time-series-per-write amount close to the buffer size.

Kafka workers they don't use a shared buffer like prometheus i guess? Since there is a metric created for each producer_id i guess they take care of the processing and writing.

The Kafka output workers use the same buffer, but there is no in-order requirement when writing to a Kafka topic. So we can use as many workers are needed.

karimra avatar Oct 09 '23 18:10 karimra

Can that single goroutine that drains the buffer be a bottleneck? Just setting a low interval e.g 1s or even sub-second i dont feel is enough to drain the buffer fast enough. And having high buffer-size and max-time-series-per-write will cause the write duration to become higher then the interval.

The reason there is a single goroutine writing to prometheus is the in-order requirement from prometheus side, see here. If we use multiple workers, there is no way to guarantee that metrics part of the same series are sent in order.

There is no longer a in-order requirement.

Prometheus has experimental support to accept out-of-order samples. It was added in 2.39.0 https://github.com/prometheus/prometheus/discussions/11415 Its still a experimental feature it seems. https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tsdb

We are using Cortex at our company and they introduced support for out-of-order samples in 1.15.0. Waiting for our prod env to be updated that will probably solve some of my issues. https://github.com/cortexproject/cortex/releases/tag/v1.15.0


Can the metric gnmic_prometheus_write_output_number_of_prometheus_write_msgs_sent_fail_total be improved/removed? Today there is one big label called reason that contains to much detailed info and even a timestamp that will cause high cardinality for that metric series if run over a long period of time.

In my case i have alot of out of order sample. And the metrics to be scraped over gnmic API just goes up up up. image

I does not have to be that detailed just having the reason, code and why it failed e.g out of order, duplicate sample for timestamp, timestamp too old in the metric.

The log message created should contain the more detailed info.

Like its done for the metric grpc_client_handled_total The log contains:

2023/10/10 17:11:37.854744 [gnmic] target "switch1": subscription interface_state rcv error: rpc error: code = OutOfRange desc = Maximum number of Subscribe requests reached
2023/10/10 17:11:37.855398 [gnmic] target "switch2": subscription interface_state rcv error: rpc error: code = Unavailable desc = JGrpcServer: Unknown RPC /gnmi.gNMI/Subscribe received

While the metric just counts on

grpc_client_handled_total{grpc_code="InvalidArgument",grpc_method="Subscribe",grpc_service="gnmi.gNMI",grpc_type="bidi_stream"} 1227
grpc_client_handled_total{grpc_code="OutOfRange",grpc_method="Subscribe",grpc_service="gnmi.gNMI",grpc_type="bidi_stream"} 11365
grpc_client_handled_total{grpc_code="Unavailable",grpc_method="Subscribe",grpc_service="gnmi.gNMI",grpc_type="bidi_stream"} 1.308698e+06

Sparc0 avatar Oct 10 '23 17:10 Sparc0

Can that single goroutine that drains the buffer be a bottleneck? Just setting a low interval e.g 1s or even sub-second i dont feel is enough to drain the buffer fast enough. And having high buffer-size and max-time-series-per-write will cause the write duration to become higher then the interval.

The reason there is a single goroutine writing to prometheus is the in-order requirement from prometheus side, see here. If we use multiple workers, there is no way to guarantee that metrics part of the same series are sent in order.

There is no longer a in-order requirement.

Prometheus has experimental support to accept out-of-order samples. It was added in 2.39.0 prometheus/prometheus#11415 Its still a experimental feature it seems. https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tsdb

We are using Cortex at our company and they introduced support for out-of-order samples in 1.15.0. Waiting for our prod env to be updated that will probably solve some of my issues. https://github.com/cortexproject/cortex/releases/tag/v1.15.0

OK, I wasn't aware that the requirement was removed, but considering that remote_write is used by other systems (cortex in your case) it makes sense to ignore it and allow for multiple writers. I will add a configurable number of workers on the write side.

Can the metric gnmic_prometheus_write_output_number_of_prometheus_write_msgs_sent_fail_total be improved/removed? Today there is one big label called reason that contains to much detailed info and even a timestamp that will cause high cardinality for that metric series if run over a long period of time.

In my case i have alot of out of order sample. And the metrics to be scraped over gnmic API just goes up up up. image

I does not have to be that detailed just having the reason, code and why it failed e.g out of order, duplicate sample for timestamp, timestamp too old in the metric.

The log message created should contain the more detailed info.

Like its done for the metric grpc_client_handled_total The log contains:

2023/10/10 17:11:37.854744 [gnmic] target "switch1": subscription interface_state rcv error: rpc error: code = OutOfRange desc = Maximum number of Subscribe requests reached
2023/10/10 17:11:37.855398 [gnmic] target "switch2": subscription interface_state rcv error: rpc error: code = Unavailable desc = JGrpcServer: Unknown RPC /gnmi.gNMI/Subscribe received

While the metric just counts on

grpc_client_handled_total{grpc_code="InvalidArgument",grpc_method="Subscribe",grpc_service="gnmi.gNMI",grpc_type="bidi_stream"} 1227
grpc_client_handled_total{grpc_code="OutOfRange",grpc_method="Subscribe",grpc_service="gnmi.gNMI",grpc_type="bidi_stream"} 11365
grpc_client_handled_total{grpc_code="Unavailable",grpc_method="Subscribe",grpc_service="gnmi.gNMI",grpc_type="bidi_stream"} 1.308698e+06

Yep, will fix that, I just blindly put the error as a label without considering the different values

karimra avatar Oct 10 '23 17:10 karimra

@Sparc0 if you build from #247 , you can configure the number of writers with num-writers: 10 All the writers will be triggered periodically every interval. I could also add a randomization of the interval (±10%) to avoid all the writers writing to prometheus at the same time. When the buffer size reaches buffer-size one of the writers will be triggered to drain the buffer and write to prometheus. There is still a single metadata writer that gets triggered every interval

I also changed the gnmic_prometheus_write_output_number_of_prometheus_write_msgs_sent_fail_total metric label to take only a few values:

  • client_failure: Incremented if the client fails to connect
  • status_code=$status_code: Incremented if the response code is higher than 300
  • marshal_error: incremented if the client failed to marshal the write request

karimra avatar Oct 10 '23 19:10 karimra

i compiled it to test the metric part only while i am waiting to get access to our staging cortex to test num-writers.

Metric i get is

gnmic_prometheus_write_output_number_of_prometheus_write_msgs_sent_fail_total{reason="status_code=400"} 1234

Sparc0 avatar Oct 11 '23 16:10 Sparc0

I believe that's the out of order delivery error, it can happen with multiple workers as well. Maybe I should create a buffer per series and reorder before writing to the remote endpoint.

karimra avatar Oct 11 '23 17:10 karimra

been trying this pull with multiple remote writers now towards our staging env. I still see out of order samples but not as many as before.

Will change this cortex setting to 6h see if that improves things.
image

Sparc0 avatar Oct 16 '23 16:10 Sparc0

If it's a sliding window I think it should be enough to set it slightly higher than twice the prometheus_write output interval value.

karimra avatar Oct 16 '23 17:10 karimra

Hi there, first poster here. I have started testing GNMIC cluster in kuberentes, I also see out of order samples, this is also with very small amounts of metrics coming in.

lazyb0nes avatar Nov 11 '23 13:11 lazyb0nes

@lazyb0nes can you share which version of gnmic you are running and a snippet of the output config ?

karimra avatar Nov 11 '23 18:11 karimra

@karimra hello, Certainly Currently running version 0.31.1, so maybe updating to the latest version shoudl be the first thing i should do?, i've also in grafana seen somthing really odd, basically the metric 'gnmic_interfaces_interface_state_counters_out_octets' has only 1 entry if I look in mimir. But I still see dual entries in the legend in Grafana, can this be realted to the out of order samples?

outputs:
  output1:
    type: prometheus_write
    url: https://mimir.example.com/api/v1/push
    headers:
      tenantid: 'gnmictest'
    tls:
      skip-verify: true
    interval: 10s
    buffer-size: 1000
    max-time-series-per-write: 500
    max-retries: 0
    metadata:
      include: false
      interval: 60s
      max-entries-per-write: 500
    metric-prefix: "gnmic"
    append-subscription-name: false 
    strings-as-labels: true
    timeout: 10s
    debug: true 
    add-target: 
    target-template:
    event-processors: 

lazyb0nes avatar Nov 11 '23 21:11 lazyb0nes

@karimra Slight update, after updating to 0.34.2 I am not as far as I can see seeing an out of order samples anymore. What i am seeins is this however

failed to convert message to event: invalid character 'i' in numeric literal

What could be causing this?

lazyb0nes avatar Nov 14 '23 18:11 lazyb0nes

probably same as: https://github.com/openconfig/gnmic/issues/240

karimra avatar Nov 14 '23 19:11 karimra

probably same as: #240

So the interesting here is that we see that but we also do see the correct values in prometheus. So it's handled at least to a certain point.

@karimra thank you for all the good works that's been done with GNMIC so far i am pushing over 90,000 metrics/s into it and I so far haven't seen many issues apart from Juniper not presenting the data in a very usable manner at times.

lazyb0nes avatar Nov 14 '23 19:11 lazyb0nes

It seems Juniper sends correct float values as well as -inf. So the float values show up properly in Prometheus.

90k/s is probably the biggest rate I head someone pushed into gNMIc, do you mind sharing a bit more details about your setup? If not here, we can do that via email or a zoom call. I'm interested in what people are doing with gNMIc to figure out where it can be further optimized.

karimra avatar Nov 14 '23 20:11 karimra

Sure i'd be happy to take it over a call. This is also still very much a proof of concept, when i go into full production i will probably have around 2000-2500 devices in total. Juniper and Cisco mainly.

lazyb0nes avatar Nov 14 '23 21:11 lazyb0nes

I would be interested to know also and to share our setup and learn how you might have done. Our most busiest collector today has 900 devices loaded and 17k metrics/s incoming. Having only Juniper devices.

Sparc0 avatar Nov 16 '23 15:11 Sparc0

@Sparc0 I don't know what the best way to discuss this would be let me know and we can have a talk and discuss this all. Ideally it would be nice if the community as a whole can benefit :)

lazyb0nes avatar Nov 20 '23 19:11 lazyb0nes