fluentd Some logs are missing

Describe the bug

In my current setup logs go to papertrail by syslog+tls and to a gcp instance by https which then goes to stackdriver. Now, some logs that are present in papertrail can't be found in stackdriver

To Reproduce

Logs are initially sent through heroku's logdrain. The logs first go to an nginx server working as a proxy then to fluentd which sends it to stackdriver.

Expected behavior

Every log that is in papertrail should be in stackdriver

Your Configuration

client setting:

<source>
  @type http
  tag <tag name>
  <parse>
    @type regexp
    expression /^.*<\S+>\d (?<time>\S+) host app web.1 - (?<severity>.), (?<message>.*)$/
  </parse>
  port <port>
  bind 0.0.0.0
  add_remote_addr https://<url>
</source>

Your Error Log

# /var/log/google-fluentd/google-fluentd.log

2019-09-01 06:25:01 +0000 [info]: #0 flushing all buffer forcedly
2019-09-01 06:25:01 +0000 [info]: #0 detected rotation of /var/log/nginx/access.log; waiting 5 seconds
2019-09-01 06:25:01 +0000 [info]: #0 following tail of /var/log/nginx/access.log
2019-09-01 06:25:01 +0000 [info]: #0 detected rotation of /var/log/syslog; waiting 5 seconds
2019-09-01 06:25:01 +0000 [info]: #0 following tail of /var/log/syslog

(no errors were found in nginx)

Additional context

agent version: google-fluentd 1.4.2 OS: Ubuntu 18.04

The text below is a portion of the logs. Asterix denote the logs that were missing

05:54:11.468349
05:54:11.474820
05:54:11.477478 *
05:54:11.481780 *
05:54:11.484050 *
05:54:11.485974 *
05:54:11.488010 *
05:54:11.491051 *
05:54:11.492902 *
05:54:11.495263 *
05:54:11.497550 *
05:54:11.498517 *
05:54:11.499052 *
05:54:12.163430
05:54:12.272951
05:54:12.298832 * 
05:54:12.304858 *
05:54:12.307521 *
05:54:12.309893 *
05:54:12.310037 *
05:54:12.311776 *
05:54:12.313578 *
05:54:12.315410 *
05:54:12.317899 *
05:54:12.319555 *
05:54:12.321456 *
05:54:12.323302 *
05:54:12.323988 *
05:54:12.324458 *
05:54:12.796234
05:54:12.916607

Sep 02 '19 04:09 vascoosx

Is this fluentd core bug? The logs are lost inside fluentd or 3rd party plugin?

Sep 02 '19 04:09 repeatedly

We can't setup gcp or other cloud service. Could you reproduce the issue on simple environment, e.g. one linux server?

Sep 02 '19 04:09 repeatedly

Thank you. I'll try reproducing it with a simpler environment. Meanwhile may you tell me if there are any spec on the maximum throughput for http source? Seems like wherever the issue stems from, it is a load related issue.

Sep 02 '19 06:09 vascoosx

Meanwhile may you tell me if there are any spec on the maximum throughput for http source?

I'm not sure because it depends on machin spec, format and more... Official article mentions one example: https://docs.fluentd.org/input/http#handle-large-data-with-batch-mode

Sep 05 '19 21:09 repeatedly

We have a similar problem when using the splunk_hec plugin to forward messages to an external splunk installation via the splunk heavy forwarder.

We have noticed that when the problem manifests, we see this error in the fluent log:

2019-08-20 14:05:44 +0000 [info]: Worker 0 finished unexpectedly with signal SIGKILL

If the worker is killed, I suspect all the of messages that were in the queue are lost. Is this a correct assumption? We're not currently configured to handle overflow conditions (by backing to a file for example). We lost three days worth of messages that had yet to be funneled over to splunk when this happened.

Looking for clarification to help determine if it's fluentd, or the plugin that is problematic.

Sep 17 '19 14:09 vguaglione

Sorry for the delay.

@vguaglione

If the worker is killed, I suspect all the of messages that were in the queue are lost. Is this a correct assumption?

We can use file buffer. Log loss due to forced process killed cannot be completely prevented, but it can be minimized.

Apr 28 '23 01:04 daipom

@vascoosx I will close this issue as there will be no update for a while.

If you are still experiencing this problem and know anything about how to reproduce it, please re-open.

Apr 28 '23 01:04 daipom