graylog2-server icon indicating copy to clipboard operation
graylog2-server copied to clipboard

Indexer failures should produce more information for root cause analysis

Open mikkolehtisalo opened this issue 1 year ago • 3 comments

What?

Indexer failure messages in the UI look something like this:

2 hours ago techlog_52 c5f5e982-287f-11ef-954a-00505687ab33 OpenSearchException[OpenSearch exception [type=mapper_parsing_exception, reason=failed to parse field [level] of type [long] in document with id 'c5f5e982-287f-11ef-954a-00505687ab33'. Preview of field's value: 'Information']]; nested: OpenSearchException[OpenSearch exception [type=illegal_argument_exception, reason=For input string: "Information"]];

This is not really helpful for resolving the issue. If you have large amount of servers, systems, and components, the issue could be in numerous components generating logs, different responsible teams and so on. It is impossible to start diagnostics when you don't even know whom to start it with.

It seems OpenSearch doesn't log the issue from the example message I provided at all. It would apparently require debug logging level to appear, and that is simply not doable when you receive huge volume of logs. Graylog should be the component that produces extra information.

Alternatives:

  • Add sender's IP to the indexer failure messages and UI (probably enough, somewhat easy to implement)
  • Add logging of the message to the server logs (probably easy, and also enough for system admins)
  • Revive the dead letter implementation (complex, most convenient for system admins)

See MessagesAdapterOS2 for clues. Offending message at least should be available in most cases.

Why?

The current indexer failures view doesn't provide basic required information for resolving the issues. It is not possible to resolve indexer failures in more complex environments.

Your Environment

n/a

mikkolehtisalo avatar Jun 12 '24 08:06 mikkolehtisalo

Hi mikkolehtisalo

I think the info you seek is already available via the "Processing and Indexing Failures" Index

If I navigate to System > Overview, to the Indexing error section and hit "show errors"

image

And look at the failed messages - I can see the cause, the source, the associated stream (and thus index) etc. The only info missing is the associated input:

image

This is enabled via System > Configuration, here:

image

Does this provide the info you need?

tellistone avatar Jun 13 '24 10:06 tellistone

Failure processing plugin doesn't seem to exist on my system.

image

mikkolehtisalo avatar Jun 20 '24 07:06 mikkolehtisalo

May I ask which version number of Graylog? Open or Enterprise?

tellistone avatar Jun 20 '24 17:06 tellistone