processing endless loop in `java.util.regex.Pattern`
This issue is coming from the community ( https://community.graylog.org/t/processing-stuck-with-all-processbufferprocessor-threads-used-up/13558/7 ) and
Current Behavior
It can happen that GROK/REGEX can send the processing into an endless loop and make Graylog not processing anything. It looks like all processing is stale.
Steps to Reproduce (for bugs)
Problem can be reproduced:
Use this patterns:
"name":"PROFTPDXFER_TIMESTAMP"
"pattern":"%{DAY} %{MONTH} %{MONTHDAY} %{HOUR}:%{MINUTE}:%{SECOND} %{YEAR}"
"name":"PROFTPD_XFERLOG",
"pattern":"%{PROFTPDXFER_TIMESTAMP:proftpd_xfer_timestamp} %{DATA:proftpd_xfer_transfer-time:int} %{HOSTNAME:proftpd_xfer_remote-host} %{INT:proftpd_xfer_file-size} %{UNIXPATH:proftpd_xfer_filename} %{WORD:proftpd_xfer_transfer-type} %{WORD:proftpd_xfer_action-flag} %{WORD:proftpd_xfer_direction} %{WORD:proftpd_xfer_access-mode} %{USERNAME:proftpd_xfer_username} %{WORD:proftpd_xfer_service-name} %{INT:proftpd_xfer_auth-method} %{DATA:proftpd_xfer_auth-userid} %{WORD:proftpd_xfer_completion-status}"
Test the pattern PROFTPDXFER_TIMESTAMP with this value:
Thu Jan 23 07:51:21 2020 0 1.2.3.4 1234 /ftproot/stuff/Tmp/filexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx{xxxx}xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xml b _ i r abcdef sftp 0 * c
This causes the io.krakens.grok.api.Grok.match() to never return. The pattern matching goes into a loop and uses up a cpu core. The actual loop occurs inside java.util.regex.Pattern, so it’s not really a Graylog specifiy problem. I think it’s some kind of bug in the JRE classes. A pattern should either match or not, and not cause the parser to loop forever.
If you have multiple messages in a pipeline that trigger this problem, it will completely halt the pipeline processing in Graylog.
Context
From time to time we see that user report regular that their Graylog is not processing anymore and a restart solves the problem. Educated guessing will help you to find the problem, but that might not be possible in a multi user/multi admin/multi tier environment.
We need to find a possible way to warn/alert about such or even better ways we are not able to think of currently.
Your Environment
- Graylog Version: 3.1
Hi,
Any progress on this issue?
We still get the occasional issue where a message makes the processing get stuck and only a restart will fix it. This is doable on a smaller system but when you have a system that ingests 30-40 thousand logs per second it is hell to fix and it requires a ton of restarts or additional processing threads.