fluentd crashes cause bad chunks that fail to send

I had a power failure and upon restarting the server, this happened:

2017-07-18 18:05:27 +0000 [info]: adding match pattern="**" type="null"
2017-07-18 18:05:27 +0000 [info]: adding source type="forward"
2017-07-18 18:05:27 +0000 [info]: #0 starting fluentd worker pid=20 ppid=16 worker=0
2017-07-18 18:05:31 +0000 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2017-07-18 18:05:31 +0000 [info]: #0 [input1] listening port port=24224 bind="0.0.0.0"
2017-07-18 18:05:31 +0000 [info]: #0 fluentd worker is now running worker=0
2017-07-18 22:25:08 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:09 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:614:in `dup'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:614:in `send_data_actual'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:631:in `send_data'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:285:in `block in try_write'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:304:in `block in select_a_healthy_node'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:298:in `times'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:298:in `select_a_healthy_node'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:285:in `try_write'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:1034:in `try_flush'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:1268:in `flush_thread_run'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:420:in `block (2 levels) in start'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2017-07-18 22:25:09 +0000 [warn]: #0 retry succeeded. chunk_id="5549b016b0815316776e7f21d3a793d8"
2017-07-18 22:25:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:10 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:09 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:10 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:10 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:10 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2017-07-18 22:25:11 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:11 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:13 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2017-07-18 22:25:13 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:13 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2017-07-18 22:25:17 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:17 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:25 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=2017-07-18 22:25:25 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:25 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:41 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=2017-07-18 22:25:41 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:41 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:26:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=6 next_retry_seconds=2017-07-18 22:26:11 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:26:11 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:27:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=7 next_retry_seconds=2017-07-18 22:27:09 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:27:09 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:29:15 +0000 [warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2017-07-18 22:29:15 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:29:15 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:33:24 +0000 [warn]: #0 failed to flush the buffer. retry_time=9 next_retry_seconds=2017-07-18 22:33:24 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:33:24 +0000 [warn]: #0 suppressed same stacktrace

Given the fact that this hasn't happened in the last month, and the chunk that it's referencing buffer.q5541970ab3b8c59d8f012cd3222cc93c.log was created right around the time the power failed, I think that it is somehow corrupted by an incomplete write. This is a problem because the error message makes no sense and crashes shouldn't cause corruption / data loss for things that are already written to the disk.

It doesn't even start sending the other chunks that aren't corrupted, I need to shut down fluentd, remove the bad chunk, and restart fluentd to get it to continue sending.

I can send you the bad chunk and meta file if you want.

Jul 20 '17 01:07 notslang

around the time the power failed

Does this mean fluentd running server hit unexpected machine shutdown, right? Hmm... in such cases, automatic recover is hard, so need to check files and corrupted chunks should be routed to other place, e.g. local file. Is the corrupted file meta data or actual chunk file?

Jul 21 '17 03:07 repeatedly

Does this mean fluentd running server hit unexpected machine shutdown, right?

Yes - the power cut off, so nothing had time to shut down.

I don't know if it's the metadata file or the actual chunk file. However, it shouldn't be too hard to just omit whatever incomplete data was written to the end of the file, right?

I don't expect fluentd to save data that was submitted between the last time data was successfully flushed to the file buffer and the crash. However, it shouldn't cause the entire chunk to fail and the error/warning that it prints when it encounters bad data should be better... If I didn't know about that power failure I would be really confused.

Jul 21 '17 03:07 notslang

I got this error again after running out of disk space on a node.

Sep 10 '17 18:09 notslang

I'm experiencing this issue as well. It's occurring on fluentd restart, not a machine crash. It doesn't happen consistently, but is happening frequently.

Nov 27 '17 15:11 mchesler

@mchesler It means you hit disk space issue and it causes broken chunk?

BTW, I will add backup feature to move bad chunks to other place.

Nov 28 '17 08:11 repeatedly

@repeatedly None of the machines where this has happened to me have com anywhere close to filling their disks, they’ve simply had the fluentd process restarted.

Nov 28 '17 11:11 mchesler

Experiencing the same, the fluent processes were OOMKilled before. @repeatedly did you had any chance to work on the "move bad chunks to other place" feature or could you point me the code handling this?

Mar 09 '18 09:03 johanneswuerbach

I am using Elasticsearch plugin to send data to the ES with file buffer. In case of any exception in indexing the document to elasticsearch, the chunk is retained for retry, but this chunk can never be successful. Because of multiple retry attempts happening, the other chunks are not processed causing consistency in data flow. Is there a way to clear or move the bad chunk?

May 01 '18 20:05 kumaravel29

+1 we experience the exact same issues as @kumaravel29 . The problem is actually so pervasive that we're in the process of ripping td-agent out of our infrastructure.

Jun 11 '18 09:06 naegelin

@kumaravel29 try using tags: https://github.com/uken/fluent-plugin-elasticsearch#retry_tag

Oct 24 '18 11:10 ku-s-h

FYI: since v1.2.0, https://github.com/fluent/fluentd/blob/master/CHANGELOG.md#release-v120---20180430 output: Backup for broken chunks is supported. It may help you.

Oct 29 '20 07:10 kenhys