chunkio: corrupt files are not removed after format check failure
Bug Report
Describe the bug
The same one as #2472. fluent-bit writes a lot of format check failed errors.
fluent-bit ends up having a lot of corrupt files (I had more than 6,000 files in three weeks before I manually deleted them). I don't think files should break so often, but that's another issue.
But the exact steps to reproduce this is still unknown.
To Reproduce
- Rubular link if applicable:
- Example log message if applicable:
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/153481-1635124800.968626242.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634936209.242908772.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635050818.818578912.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634958778.842931521.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/358011-1635205582.68650748.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634948938.792934702.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634949038.342960395.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635068888.54486252.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/153481-1635127142.218614831.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635060898.868574974.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634954400.142932279.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/358011-1635197997.218599676.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/153481-1635126782.593821399.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635076412.268625949.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/358011-1635205497.218642296.flb
- Steps to reproduce the problem: not known
Expected behavior
fluent-bit deletes the file.
Screenshots
Your Environment
- Version used: 1.8.9
- Configuration: invoked by google-cloud-ops-agent
- Environment name and version (e.g. Kubernetes? What version?): Google Compute Engine
- Server type and version:
- Operating System and version: Debian 11
- Filters and plugins: in_tail
[2021/11/03 20:00:14] [ info] [storage] version=1.1.5, initializing...
[2021/11/03 20:00:14] [ info] [storage] root path '/var/lib/google-cloud-ops-agent/fluent-bit/buffers'
[2021/11/03 20:00:14] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/03 20:00:14] [ info] [storage] backlog input plugin: storage_backlog.3
[2021/11/03 20:00:14] [ info] [cmetrics] version=0.2.2
[2021/11/03 20:00:14] [ info] [input:storage_backlog:storage_backlog.3] queue memory limit: 47.7M
Additional context
I think there's a case that's not handled by the chunk io library.
mmap_file emits the "format check failed" message and returns CIO_CORRUPTED. There's a path from flb_engine_dispatch > flb_input_chunk_flush > cio_chunk_up > cio_file_up > _cio_file_up > mmap_file.
_cio_file_up doesn't handle errors if the returned value is not exactly CIO_ERROR (it's CIO_CORRUPTED in this case). So the value gets propagated back to flb_engine_dispatch as NULL (there's another path to mmap_file through cio_chunk_get_content but again the caller only checks if the return value is -1), and it's essentially a noop and the same corrupt entry is processed again.
fluent-bit exits immediately when you have corrupt files.
[2021/11/03 20:04:49] [ info] [input:storage_backlog:storage_backlog.3] register tail.1/358011-1635207481.939506184.flb
[2021/11/03 20:04:49] [ info] [input:storage_backlog:storage_backlog.3] register tail.1/358011-1635207500.924286502.flb
[2021/11/03 20:04:49] [ info] [input:storage_backlog:storage_backlog.3] register tail.1/358011-1635207520.924198297.flb
[2021/11/03 20:04:49] [ info] [input:storage_backlog:storage_backlog.3] register tail.1/358011-1635207541.436151548.flb
[2021/11/03 20:04:49] [ info] [input:storage_backlog:storage_backlog.3] register tail.1/358011-1635207560.924753282.flb
[2021/11/03 20:04:49] [error] [storage] format check failed: emitter.5/1538-1634231033.970464246.flb
[2021/11/03 20:04:49] [error] [engine] could not segregate backlog chunks
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopping...
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopped
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #1 stopping...
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #1 stopped
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #2 stopping...
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #2 stopped
@edsiper can confirm this, but I think this was fixed by https://github.com/fluent/fluent-bit/commit/476a3e2c8262036dc36abcb06aefe1329ba8daa6 that was part of Fluent Bit 1.8.9.
However 1.8.9 has a separate issue (https://github.com/fluent/fluent-bit/issues/4262 Spamming errors in the log about failed connections to the backend in 1.8.9) that is we are waiting for a fix before upgrading Fluent Bit to pick up this fix.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
/unstale
I am having the same issue on 1.8.12
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235427.640856682.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235430.417979474.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235432.694286469.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235435.360446477.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235437.868210348.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235440.834408777.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235443.405261013.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235445.970820359.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235448.931482174.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235451.880113456.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235454.193451130.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235456.510927192.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235458.992008351.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235461.361955821.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235464.214223209.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235466.580794364.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235469.121999763.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235471.562680073.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235474.116397542.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235476.609435547.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235478.928214444.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235481.320918103.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235483.524247591.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235486.252512087.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235488.664946659.flb
[2022/03/14 09:45:36] [error] [storage] format check failed: tail.0/8-1647235496.488950766.flb
[2022/03/14 09:45:36] [error] [engine] could not segregate backlog chunks
Below is my config file
[Service]
Flush 5
Daemon off
Log_Level info
Log_File /bp2/log/mcp-fluentbit.log
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
storage.path /bp2/data/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 500M
storage.metrics On
[INPUT]
Name tail
Buffer_Max_Size 100000
Skip_Long_Lines On
storage.type filesystem
Path /custom_log_location/blueplanet.log
DB /bp2/data/tail.db
[FILTER]
Name lua
Match *
script /bp2/scripts/parser2.lua
call cb_replace
[OUTPUT]
Name es
Match *
Host esinternal7
Port 9200
Logstash_Format On
Logstash_Prefix logs
Trace_Error On
Trace_Output Off
Time_Key fbit_timestamp
Fluentbit running command output :
root@2d20a6aba432:/bp2/src# /opt/td-agent-bit/bin/td-agent-bit -c config/fbit-conf.toml -v
Fluent Bit v1.8.12
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/03/14 09:54:21] [ info] [input] pausing tail.0
[2022/03/14 09:54:21] [ info] [input] pausing storage_backlog.1
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
No-stale please!
Is the issue reproducible after the fix https://github.com/fluent/fluent-bit/pull/5266 was merged?
We had some similar issues, but seems resolved after this fix.
I have the same issue and it still reproducible.
errors in log
[2022/08/09 10:28:13] [error] [input chunk] error writing data from tail.0 instance
[2022/08/09 10:28:13] [error] [storage] format check failed: tail.0/11762-1660012205.396769419.flb
[2022/08/09 10:28:13] [error] [storage] [cio file] file is not mmap()ed: tail.0:11762-1660012205.396769419.flb
stack trace
#0 mmap_file (ctx=0x7f18c9811080, ch=ch@entry=0x7f18c64d56c0, size=<optimized out>) at lib/chunkio/src/cio_file.c:396
#1 0x000000000054606a in _cio_file_up (ch=ch@entry=0x7f18c64d56c0, enforced=enforced@entry=1) at lib/chunkio/src/cio_file.c:891
#2 0x0000000000545f9a in cio_file_up (ch=ch@entry=0x7f18c64d56c0) at lib/chunkio/src/cio_file.c:920
#3 0x0000000000545063 in cio_chunk_up (ch=0x7f18c64d56c0) at lib/chunkio/src/cio_chunk.c:542
#4 0x0000000000c569dc in flb_input_chunk_flush (ic=ic@entry=0x7f18c8dcb880, size=size@entry=0x7f18ca5fa410)
at src/flb_input_chunk.c:1615
#5 0x0000000000c47e9d in flb_engine_dispatch (id=id@entry=0, in=in@entry=0x7f18caa49000, config=config@entry=0x7f18caa0c800)
at src/flb_engine_dispatch.c:249
#6 0x0000000000c331b8 in flb_engine_flush (config=0x7f18caa0c800, in_force=0x0) at src/flb_engine.c:117
#7 flb_engine_handle_event (fd=37, config=0x7f18caa0c800, mask=<optimized out>) at src/flb_engine.c:425
#8 flb_engine_start (config=<optimized out>, config@entry=0x7f18caa0c800) at src/flb_engine.c:763
#9 0x0000000000c5855d in flb_lib_worker (data=0x7f18caa0b000) at src/flb_lib.c:626
#10 0x00007f18cb222609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
I tries the patch 5858 and it helps, fluent bit skips corrupted file and continues processing logs
Do you have the storage.checksum on configured in the service to skip the corrupted chunks?
Yes. My config is
[SERVICE]
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 127.0.0.1
HTTP_PORT 2020
Health_Check On
HC_Errors_Count 5
HC_Retry_Failure_Count 5
HC_Period 60
Storage.Path /run/fluent-bit-storage/
Storage.Max_Chunks_Up 64
Storage.Backlog.Mem_Limit 32M
Storage.Checksum On
Storage.Metrics On
Flush 2
[INPUT]
Name tail
Path ${LOGS_DIR}/*.log
Parser grut_log_parser
Read_from_Head True
Skip_Long_Lines True
DB ${LOGS_DIR}/fluent-bit.db
DB.locking True
Buffer_Chunk_Size 2M
Buffer_Max_Size 32M
Mem_Buf_Limit 128M
Storage.Type filesystem
[FILTER]
Name lua
Match *
script convert_to_clickhouse_record.lua
call convert_to_clickhouse_record
[OUTPUT]
Name http
Match *
Host ${CLICKHOUSE_HOST}
Port ${CLICKHOUSE_PORT}
URI /?database=default&query=INSERT%20INTO%20${CLICKHOUSE_TABLE}%20FORMAT%20JSONEachRow
Format json_stream
Json_Date_Key False
Header X-ClickHouse-User ${CLICKHOUSE_USER}
Header X-ClickHouse-Key ${CLICKHOUSE_PASSWORD}
Tls On
Retry_Limit 10
OK thanks for the update. @edsiper Hi Eduardo, could you please take a look of https://github.com/fluent/fluent-bit/pull/5858 then? Thanks!
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.