graylog2-server Improve handling of ClusterEvents for big clusters

For bigger graylog-clusters, the processing performance for ClusterEvents should be improved.

This results out of a support-request, where the customer is using 3 nodes for mongodb (7.0.18-11 running on RHEL 8.10) for 173 graylog-instances (v6.1.4 running on RHEL 8.10 too). They are experiencing many slow queries in mongodb, out of which queries on the cluster_events collection are dominating. The nodes running mongodb have 16 CPUs and 64GB RAM.

Here are two examples for slow queries:

Slow log 1

``` { "t": { "$date": "2025-10-24T13:37:42.959+08:00" }, "s": "I", "c": "COMMAND", "id": 51803, "ctx": "conn5054", "msg": "Slow query", "attr": { "type": "command", "ns": "graylog.cluster_events", "command": { "find": "cluster_events", "filter": { "consumers": { "$nin": [ "15419580-f7db-44ed-8d5d-c87ca2e9bd6f" ] } }, "sort": { "timestamp": 1 }, "$db": "graylog", "$clusterTime": { "clusterTime": { "$timestamp": { "t": 1761284261, "i": 553 } }, "signature": { "hash": { "$binary": { "base64": "NSKrc4o7iShdqOERlz9qYiqQKLs=", "subType": "0" } }, "keyId": 7517607946839130698 } }, "lsid": { "id": { "$uuid": "d633d1f6-90e2-4564-ab96-e29fc5aea001" } }, "$readPreference": { "mode": "secondaryPreferred" } }, "planSummary": "IXSCAN { timestamp: 1, producer: 1, consumers: 1 }", "planningTimeMicros": 14826, "keysExamined": 165042, "docsExamined": 954, "nBatches": 1, "cursorExhausted": true, "numYields": 168, "nreturned": 0, "queryHash": "D54F9547", "planCacheKey": "B0C9ACF8", "queryFramework": "classic", "reslen": 236, "locks": { "FeatureCompatibilityVersion": { "acquireCount": { "r": 169 } }, "Global": { "acquireCount": { "r": 169 } } }, "readConcern": { "level": "local", "provenance": "implicitDefault" }, "storage": { "data": { "bytesRead": 7 } }, "cpuNanos": 116452705, "remote": "10.91.142.79:50776", "protocol": "op_msg", "durationMillis": 1396 } } ```

Slow log 2

``` { "t": { "$date": "2025-11-04T13:37:30.538+08:00" }, "s": "I", "c": "COMMAND", "id": 51803, "ctx": "conn6762", "msg": "Slow query", "attr": { "type": "command", "ns": "graylog.cluster_events", "command": { "find": "cluster_events", "filter": { "consumers": { "$nin": [ "3fbc1652-ad22-4b9f-a16e-9a13e9fe4c2c" ] } }, "sort": { "timestamp": 1 }, "$db": "graylog", "$clusterTime": { "clusterTime": { "$timestamp": { "t": 1762234650, "i": 419 } }, "signature": { "hash": { "$binary": { "base64": "cG1gTUPYCBroyuTMUeBAzI7zutg=", "subType": "0" } }, "keyId": 7517607946839130698 } }, "lsid": { "id": { "$uuid": "2eab1e96-fdc9-47b6-9a9a-180e1536aab1" } }, "$readPreference": { "mode": "secondaryPreferred" } }, "planSummary": "IXSCAN { timestamp: 1, producer: 1, consumers: 1 }", "planningTimeMicros": 98, "keysExamined": 119024, "docsExamined": 688, "nBatches": 1, "cursorExhausted": true, "numYields": 119, "nreturned": 0, "queryHash": "D54F9547", "planCacheKey": "B0C9ACF8", "queryFramework": "classic", "reslen": 236, "locks": { "FeatureCompatibilityVersion": { "acquireCount": { "r": 120 } }, "Global": { "acquireCount": { "r": 120 } } }, "readConcern": { "level": "local", "provenance": "implicitDefault" }, "storage": { "data": { "bytesRead": 1 } }, "cpuNanos": 102068077, "remote": "10.91.142.96:38818", "protocol": "op_msg", "durationMillis": 102 } } ```

Analysis

The most important part of the slow logs are the following bits:

      "planSummary": "IXSCAN { timestamp: 1, producer: 1, consumers: 1 }",
      "keysExamined": 119024,
      "docsExamined": 688,
      "nreturned": 0,

Meaning: The query is using an index (in general good), but it queries some 119k entries of the index (keysExamined), checking 688 documents afterwards (docsExamined) and returns exactly no matching document (nreturned). It would be optimal to have very similar numbers in these fields.

The responsible query collection.find(Filters.nin("consumers", nodeId.getNodeId())).sort(Sorts.ascending("timestamp")) which can be found here checks for events not having the current node's id in the consumers-field. Unfortunately the "not in"-function cannot use this index efficiently, as it is not a selective operation. The query basically checks every single entry in the index, making is basically a full collection scan.

As events are processed pretty quickly by all nodes and processed events are only removed once a day, the periodical, which runs every second, spends a lot of time to go through the consumers-field of the event, which contains all consumers. The more nodes exist in the cluster, the worse the performance gets.

Possible Solution

It would be best to avoid the $nin-function and replace it with $in or some other selective function, which makes better use of an index. But this is not easily done.

Ideas:

One could invert the logic and for each new event add all known nodes as missing_consumers. Drawback: graylog is pretty aggressive on disconnecting nodes from the cluster. How can these nodes also receive and process the event?
Have multiple (maybe 10?) fields for consumers. This would allow the DB to check these in parallel
Introduce another collection with a document per node, where processed events are stored. That way the query doesn't have to go through an array of ids. This obviously has to be cleaned up with the processed events to not take up too much space in the DB.

Nov 05 '25 12:11 fpetersen-gl

What about using a capped collection for cluster_events and using a tailable cursor? This way we would not need indices on the collection at all and reading new events is rather cheap. If we do not want to use a capped collection (e.g. because we do not know how to size it properly), we could also just track the timestamp of the most recently seen event per node. Both ways would also reduce writes on the cluster_events collection.

Nov 05 '25 13:11 dennisoelkers

Thanks for joining in! The idea with capped collections and tailable cursors sounds cool. You already mentioned sizing of the capped collection as a possible pitfall, I would also add that the inserted timestamp comes from the nodes themselves, so their clocks need to be carefully synced to make the ordering reliable.

Nov 05 '25 14:11 fpetersen-gl

I would also add that the inserted timestamp comes from the nodes themselves, so their clocks need to be carefully synced to make the ordering reliable.

Does it have to be that way, or could the time from MongoDB be used instead?

Nov 06 '25 09:11 boosty

We should first reproduce the problem locally so we can actually measure the effect of any changes. We probably need to write some custom Java code; we don't want to start 170 Graylog servers. Once we have that, I would first check if it's enough to make the cleanup interval configurable so that the customer can reduce the default setting (currently one day) for their setup.

We need to backport any change, so we don't want to start with big changes.

Tailable cursors (as Dennis wrote) are another alternative we could test. That requires us to convert the cluster_events collection to a capped collection. That might have other drawbacks.

We can't use Change Streams by default, because those require a replica set. But it might be feasible to use Change Streams in environments that have a replica set and fall back to the existing method for environments without a replica set.

But first, we should be able to reproduce the issue and measure any improvements.

Nov 06 '25 09:11 bernd

I would also add that the inserted timestamp comes from the nodes themselves, so their clocks need to be carefully synced to make the ordering reliable.

Does it have to be that way, or could the time from MongoDB be used instead?

We should not rely on timestamps. There is no stable sort order for events that happen at the same time.

Nov 06 '25 09:11 bernd

Original support request: Graylog2/support#358

Nov 06 '25 10:11 fpetersen-gl

I support @bernd's suggestion to make this reproducible and attempt conservative fixes 100%, but just documenting two things for later, in case we need to apply a more radical approach:

Does it have to be that way, or could the time from MongoDB be used instead?

We implemented that for our node heartbeat quite a while ago and it works well. We solved the issue of drifting node clocks with that.

We should not rely on timestamps. There is no stable sort order for events that happen at the same time.

Ideally, we make handling of events idempotent. So, e.g. a new/deleted stream triggers refreshing the stream router, but for multiple events the outcome is the same, so ordering does not matter.

Nov 06 '25 12:11 dennisoelkers

Ideally, we make handling of events idempotent. So, e.g. a new/deleted stream triggers refreshing the stream router, but for multiple events the outcome is the same, so ordering does not matter.

Agreed! But if we would want to use the timestamp as cursor/offset we would have the ordering issue.

Nov 06 '25 14:11 bernd