Negative values in pipeline rules throughput
Expected Behavior
The throughput should always be correct
Current Behavior
Throughput values from the Pipeline rules toggle from positive to negative
Steps to Reproduce
-
You need a lot of traffic, we couldn't reproduce, but the customer also said if we need help with reproducing he is there to help.
-
Recording shows the whole problem live recording.webm
-
Create a Pipeline rule with a lot of traffic
-
check the throughput values
Context
Seeing Throughput values from the Pipeline rules toggle from positive to negative - see attached recording. Only on one pipeline called 'Streamrouting' which was set up 7 years ago and running. Customer has ~70 pipelines. They checked some of them but only see the negative values only on 'Streamrouting'.
Customer aware first click to navigate on the Pipeline rules page shows wrong values, but the wrong values are displayed constantly. They toggle from positive to negative every few seconds without any other interaction with the page.
Values are correct on Manage Pipelines >> Pipelines overview page - don't see negative throughput numbers there.
On one screenshot we see ~300 million msg/s under Throughput but only around ~11k coming in On second screenshot ~300 million msg/s under Throughput but only around ~11k incoming
Code of the rule with millions of messages:
rule "continue to next stage"
when
true
then
set_field("continue", true);
remove_field("continue");
end
Screenshot showing minus ~385 million msg/s under Throughput
Screenshot showing ~300 million msg/s under Throughput but only around ~11k coming in
The issue was discussed in Slack https://graylog.slack.com/archives/C036LC4K744/p1717668058792819 and from there it sounds like a known issue, and no one submitted a bug report until now.
Customer Environment
Graylog Version: 5.2.7
OpenSearch Version: 7.10.2
MongoDB Version: 6.0.12
(created from Zendesk ticket #570)
gz#570
https://github.com/Graylog2/graylog2-server/issues/19696 may be related too?
My original theory had something to do with a number being too large for its data type, such as a number > 2,147,483,647 for a signed 32bit data type. I'm not sure if that is relevant though.
The other thing i am noticing, which i can reproduce somewhat consistently, is how the graylog shows a much larger value initially before showing the accurate metric. I suspect this is showing the "total" amount since the amount rather than the accurate metric.
For example, i can sometimes get my pipeline to display a number of about 7 million, which matches up with the metric's total value:
{
"full_name": "org.graylog.plugins.pipelineprocessor.ast.Pipeline.62976aa4578cf42110255552.executed",
"metric": {
"rate": {
"total": 7189178,
"mean": 88.88558165055522,
"five_minute": 111.75158774985267,
"fifteen_minute": 114.95970546937819,
"one_minute": 102.30150129886736
},
"rate_unit": "events/second"
},
"name": "executed",
"type": "meter"
}
In the video below i am boing back/forward in the browser to trigger this behavior:
https://videos.graylog.com/watch/VTdQfbzHSvgFAA6n2r4AM1?
Its not clear if this is the same issue or not. It also appears the metric rate calculation is being done on the front end?
@waab76 Can we pls get an update on this issue?
@tellistone Any idea when the Core Team might have appetite for this?