fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

crashes cause bad chunks that fail to send

Open notslang opened this issue 8 years ago • 11 comments

I had a power failure and upon restarting the server, this happened:

2017-07-18 18:05:27 +0000 [info]: adding match pattern="**" type="null"
2017-07-18 18:05:27 +0000 [info]: adding source type="forward"
2017-07-18 18:05:27 +0000 [info]: #0 starting fluentd worker pid=20 ppid=16 worker=0
2017-07-18 18:05:31 +0000 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2017-07-18 18:05:31 +0000 [info]: #0 [input1] listening port port=24224 bind="0.0.0.0"
2017-07-18 18:05:31 +0000 [info]: #0 fluentd worker is now running worker=0
2017-07-18 22:25:08 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:09 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:614:in `dup'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:614:in `send_data_actual'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:631:in `send_data'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:285:in `block in try_write'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:304:in `block in select_a_healthy_node'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:298:in `times'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:298:in `select_a_healthy_node'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:285:in `try_write'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:1034:in `try_flush'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:1268:in `flush_thread_run'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/output.rb:420:in `block (2 levels) in start'
  2017-07-18 22:25:08 +0000 [warn]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2017-07-18 22:25:09 +0000 [warn]: #0 retry succeeded. chunk_id="5549b016b0815316776e7f21d3a793d8"
2017-07-18 22:25:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:10 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:09 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:10 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2017-07-18 22:25:10 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:10 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2017-07-18 22:25:11 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:11 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:13 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2017-07-18 22:25:13 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:13 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2017-07-18 22:25:17 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:17 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:25 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=2017-07-18 22:25:25 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:25 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:25:41 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=2017-07-18 22:25:41 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:25:41 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:26:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=6 next_retry_seconds=2017-07-18 22:26:11 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:26:11 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:27:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=7 next_retry_seconds=2017-07-18 22:27:09 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:27:09 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:29:15 +0000 [warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2017-07-18 22:29:15 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:29:15 +0000 [warn]: #0 suppressed same stacktrace
2017-07-18 22:33:24 +0000 [warn]: #0 failed to flush the buffer. retry_time=9 next_retry_seconds=2017-07-18 22:33:24 +0000 chunk="5541970ab3b8c59d8f012cd3222cc93c" error_class=TypeError error="can't dup NilClass"
  2017-07-18 22:33:24 +0000 [warn]: #0 suppressed same stacktrace

Given the fact that this hasn't happened in the last month, and the chunk that it's referencing buffer.q5541970ab3b8c59d8f012cd3222cc93c.log was created right around the time the power failed, I think that it is somehow corrupted by an incomplete write. This is a problem because the error message makes no sense and crashes shouldn't cause corruption / data loss for things that are already written to the disk.

It doesn't even start sending the other chunks that aren't corrupted, I need to shut down fluentd, remove the bad chunk, and restart fluentd to get it to continue sending.

I can send you the bad chunk and meta file if you want.

notslang avatar Jul 20 '17 01:07 notslang

around the time the power failed

Does this mean fluentd running server hit unexpected machine shutdown, right? Hmm... in such cases, automatic recover is hard, so need to check files and corrupted chunks should be routed to other place, e.g. local file. Is the corrupted file meta data or actual chunk file?

repeatedly avatar Jul 21 '17 03:07 repeatedly

Does this mean fluentd running server hit unexpected machine shutdown, right?

Yes - the power cut off, so nothing had time to shut down.

I don't know if it's the metadata file or the actual chunk file. However, it shouldn't be too hard to just omit whatever incomplete data was written to the end of the file, right?

I don't expect fluentd to save data that was submitted between the last time data was successfully flushed to the file buffer and the crash. However, it shouldn't cause the entire chunk to fail and the error/warning that it prints when it encounters bad data should be better... If I didn't know about that power failure I would be really confused.

notslang avatar Jul 21 '17 03:07 notslang

I got this error again after running out of disk space on a node.

notslang avatar Sep 10 '17 18:09 notslang

I'm experiencing this issue as well. It's occurring on fluentd restart, not a machine crash. It doesn't happen consistently, but is happening frequently.

mchesler avatar Nov 27 '17 15:11 mchesler

@mchesler It means you hit disk space issue and it causes broken chunk?

BTW, I will add backup feature to move bad chunks to other place.

repeatedly avatar Nov 28 '17 08:11 repeatedly

@repeatedly None of the machines where this has happened to me have com anywhere close to filling their disks, they’ve simply had the fluentd process restarted.

mchesler avatar Nov 28 '17 11:11 mchesler

Experiencing the same, the fluent processes were OOMKilled before. @repeatedly did you had any chance to work on the "move bad chunks to other place" feature or could you point me the code handling this?

johanneswuerbach avatar Mar 09 '18 09:03 johanneswuerbach

I am using Elasticsearch plugin to send data to the ES with file buffer. In case of any exception in indexing the document to elasticsearch, the chunk is retained for retry, but this chunk can never be successful. Because of multiple retry attempts happening, the other chunks are not processed causing consistency in data flow. Is there a way to clear or move the bad chunk?

kumaravel29 avatar May 01 '18 20:05 kumaravel29

+1 we experience the exact same issues as @kumaravel29 . The problem is actually so pervasive that we're in the process of ripping td-agent out of our infrastructure.

naegelin avatar Jun 11 '18 09:06 naegelin

@kumaravel29 try using tags: https://github.com/uken/fluent-plugin-elasticsearch#retry_tag

ku-s-h avatar Oct 24 '18 11:10 ku-s-h

FYI: since v1.2.0, https://github.com/fluent/fluentd/blob/master/CHANGELOG.md#release-v120---20180430 output: Backup for broken chunks is supported. It may help you.

kenhys avatar Oct 29 '20 07:10 kenhys