fluent-bit chunkio: corrupt files are not removed after format check failure

Bug Report

Describe the bug

The same one as #2472. fluent-bit writes a lot of format check failed errors.

fluent-bit ends up having a lot of corrupt files (I had more than 6,000 files in three weeks before I manually deleted them). I don't think files should break so often, but that's another issue.

But the exact steps to reproduce this is still unknown.

To Reproduce

Rubular link if applicable:
Example log message if applicable:

[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/153481-1635124800.968626242.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634936209.242908772.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635050818.818578912.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634958778.842931521.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/358011-1635205582.68650748.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634948938.792934702.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634949038.342960395.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635068888.54486252.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/153481-1635127142.218614831.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635060898.868574974.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/176355-1634954400.142932279.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/358011-1635197997.218599676.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/153481-1635126782.593821399.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/12048-1635076412.268625949.flb
[2021/10/29 21:46:07] [error] [storage] format check failed: emitter.3/358011-1635205497.218642296.flb

Steps to reproduce the problem: not known

Expected behavior

fluent-bit deletes the file.

Screenshots

Your Environment

Version used: 1.8.9
Configuration: invoked by google-cloud-ops-agent
Environment name and version (e.g. Kubernetes? What version?): Google Compute Engine
Server type and version:
Operating System and version: Debian 11
Filters and plugins: in_tail

[2021/11/03 20:00:14] [ info] [storage] version=1.1.5, initializing...
[2021/11/03 20:00:14] [ info] [storage] root path '/var/lib/google-cloud-ops-agent/fluent-bit/buffers'
[2021/11/03 20:00:14] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/03 20:00:14] [ info] [storage] backlog input plugin: storage_backlog.3
[2021/11/03 20:00:14] [ info] [cmetrics] version=0.2.2
[2021/11/03 20:00:14] [ info] [input:storage_backlog:storage_backlog.3] queue memory limit: 47.7M

Additional context

I think there's a case that's not handled by the chunk io library.

mmap_file emits the "format check failed" message and returns CIO_CORRUPTED. There's a path from flb_engine_dispatch > flb_input_chunk_flush > cio_chunk_up > cio_file_up > _cio_file_up > mmap_file.

_cio_file_up doesn't handle errors if the returned value is not exactly CIO_ERROR (it's CIO_CORRUPTED in this case). So the value gets propagated back to flb_engine_dispatch as NULL (there's another path to mmap_file through cio_chunk_get_content but again the caller only checks if the return value is -1), and it's essentially a noop and the same corrupt entry is processed again.

Nov 03 '21 22:11 yuryu

fluent-bit exits immediately when you have corrupt files.

[2021/11/03 20:04:49] [ info] [input:storage_backlog:storage_backlog.3] register tail.1/358011-1635207481.939506184.flb
[2021/11/03 20:04:49] [ info] [input:storage_backlog:storage_backlog.3] register tail.1/358011-1635207500.924286502.flb
[2021/11/03 20:04:49] [ info] [input:storage_backlog:storage_backlog.3] register tail.1/358011-1635207520.924198297.flb
[2021/11/03 20:04:49] [ info] [input:storage_backlog:storage_backlog.3] register tail.1/358011-1635207541.436151548.flb
[2021/11/03 20:04:49] [ info] [input:storage_backlog:storage_backlog.3] register tail.1/358011-1635207560.924753282.flb
[2021/11/03 20:04:49] [error] [storage] format check failed: emitter.5/1538-1634231033.970464246.flb
[2021/11/03 20:04:49] [error] [engine] could not segregate backlog chunks
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopping...
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #0 stopped
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #1 stopping...
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #1 stopped
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #2 stopping...
[2021/11/03 20:04:50] [ info] [output:stackdriver:stackdriver.0] thread worker #2 stopped

Nov 04 '21 17:11 yuryu

@edsiper can confirm this, but I think this was fixed by https://github.com/fluent/fluent-bit/commit/476a3e2c8262036dc36abcb06aefe1329ba8daa6 that was part of Fluent Bit 1.8.9.

However 1.8.9 has a separate issue (https://github.com/fluent/fluent-bit/issues/4262 Spamming errors in the log about failed connections to the backend in 1.8.9) that is we are waiting for a fix before upgrading Fluent Bit to pick up this fix.

Nov 16 '21 06:11 qingling128

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Mar 03 '22 02:03 github-actions[bot]

/unstale

Mar 06 '22 02:03 JeffLuoo

I am having the same issue on 1.8.12

[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235427.640856682.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235430.417979474.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235432.694286469.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235435.360446477.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235437.868210348.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235440.834408777.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235443.405261013.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235445.970820359.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235448.931482174.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235451.880113456.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235454.193451130.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235456.510927192.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235458.992008351.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235461.361955821.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235464.214223209.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235466.580794364.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235469.121999763.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235471.562680073.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235474.116397542.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235476.609435547.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235478.928214444.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235481.320918103.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235483.524247591.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235486.252512087.flb
[2022/03/14 09:45:36] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/8-1647235488.664946659.flb
[2022/03/14 09:45:36] [error] [storage] format check failed: tail.0/8-1647235496.488950766.flb
[2022/03/14 09:45:36] [error] [engine] could not segregate backlog chunks

Below is my config file

[Service]
    Flush                       5
    Daemon                      off
    Log_Level                   info
    Log_File                    /bp2/log/mcp-fluentbit.log
    HTTP_Server                 On
    HTTP_Listen                 0.0.0.0
    HTTP_PORT                   2020
    storage.path                /bp2/data/flb-storage/
    storage.sync                normal
    storage.checksum            off
    storage.backlog.mem_limit   500M
    storage.metrics             On

[INPUT]
    Name            tail
    Buffer_Max_Size 100000
    Skip_Long_Lines On
    storage.type    filesystem
    Path            /custom_log_location/blueplanet.log
    DB              /bp2/data/tail.db


[FILTER]
    Name    lua
    Match   *
    script  /bp2/scripts/parser2.lua
    call    cb_replace

[OUTPUT]
    Name               es
    Match              *
    Host               esinternal7
    Port               9200
    Logstash_Format    On
    Logstash_Prefix    logs
    Trace_Error        On
    Trace_Output       Off
    Time_Key           fbit_timestamp

Fluentbit running command output :

 root@2d20a6aba432:/bp2/src# /opt/td-agent-bit/bin/td-agent-bit -c config/fbit-conf.toml -v
Fluent Bit v1.8.12
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/03/14 09:54:21] [ info] [input] pausing tail.0
[2022/03/14 09:54:21] [ info] [input] pausing storage_backlog.1

Mar 14 '22 09:03 AshishVerma7690

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Jun 13 '22 02:06 github-actions[bot]

No-stale please!

Jun 13 '22 11:06 aslafy-z

Is the issue reproducible after the fix https://github.com/fluent/fluent-bit/pull/5266 was merged?

We had some similar issues, but seems resolved after this fix.

Jun 27 '22 03:06 qingling128

I have the same issue and it still reproducible.

Aug 09 '22 07:08 bgaifullin

errors in log

[2022/08/09 10:28:13] [error] [input chunk] error writing data from tail.0 instance
[2022/08/09 10:28:13] [error] [storage] format check failed: tail.0/11762-1660012205.396769419.flb
[2022/08/09 10:28:13] [error] [storage] [cio file] file is not mmap()ed: tail.0:11762-1660012205.396769419.flb

stack trace

#0  mmap_file (ctx=0x7f18c9811080, ch=ch@entry=0x7f18c64d56c0, size=<optimized out>) at lib/chunkio/src/cio_file.c:396
#1  0x000000000054606a in _cio_file_up (ch=ch@entry=0x7f18c64d56c0, enforced=enforced@entry=1) at lib/chunkio/src/cio_file.c:891
#2  0x0000000000545f9a in cio_file_up (ch=ch@entry=0x7f18c64d56c0) at lib/chunkio/src/cio_file.c:920
#3  0x0000000000545063 in cio_chunk_up (ch=0x7f18c64d56c0) at lib/chunkio/src/cio_chunk.c:542
#4  0x0000000000c569dc in flb_input_chunk_flush (ic=ic@entry=0x7f18c8dcb880, size=size@entry=0x7f18ca5fa410)
    at src/flb_input_chunk.c:1615
#5  0x0000000000c47e9d in flb_engine_dispatch (id=id@entry=0, in=in@entry=0x7f18caa49000, config=config@entry=0x7f18caa0c800)
    at src/flb_engine_dispatch.c:249
#6  0x0000000000c331b8 in flb_engine_flush (config=0x7f18caa0c800, in_force=0x0) at src/flb_engine.c:117
#7  flb_engine_handle_event (fd=37, config=0x7f18caa0c800, mask=<optimized out>) at src/flb_engine.c:425
#8  flb_engine_start (config=<optimized out>, config@entry=0x7f18caa0c800) at src/flb_engine.c:763
#9  0x0000000000c5855d in flb_lib_worker (data=0x7f18caa0b000) at src/flb_lib.c:626
#10 0x00007f18cb222609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

Aug 09 '22 07:08 bgaifullin

I tries the patch 5858 and it helps, fluent bit skips corrupted file and continues processing logs

Aug 09 '22 08:08 bgaifullin

Do you have the storage.checksum on configured in the service to skip the corrupted chunks?

Aug 09 '22 18:08 JeffLuoo

Yes. My config is

[SERVICE]
    Parsers_File parsers.conf
    HTTP_Server On
    HTTP_Listen 127.0.0.1
    HTTP_PORT 2020
    Health_Check On
    HC_Errors_Count 5
    HC_Retry_Failure_Count 5
    HC_Period 60
    Storage.Path /run/fluent-bit-storage/
    Storage.Max_Chunks_Up 64
    Storage.Backlog.Mem_Limit 32M
    Storage.Checksum On
    Storage.Metrics On
    Flush 2

[INPUT]
    Name tail
    Path ${LOGS_DIR}/*.log
    Parser grut_log_parser
    Read_from_Head True
    Skip_Long_Lines True
    DB ${LOGS_DIR}/fluent-bit.db
    DB.locking True
    Buffer_Chunk_Size 2M
    Buffer_Max_Size 32M
    Mem_Buf_Limit 128M
    Storage.Type filesystem

[FILTER]
    Name lua
    Match *
    script convert_to_clickhouse_record.lua
    call convert_to_clickhouse_record

[OUTPUT]
    Name http
    Match *
    Host ${CLICKHOUSE_HOST}
    Port ${CLICKHOUSE_PORT}
    URI /?database=default&query=INSERT%20INTO%20${CLICKHOUSE_TABLE}%20FORMAT%20JSONEachRow
    Format json_stream
    Json_Date_Key False
    Header X-ClickHouse-User ${CLICKHOUSE_USER}
    Header X-ClickHouse-Key ${CLICKHOUSE_PASSWORD}
    Tls On
    Retry_Limit 10

Aug 09 '22 18:08 bgaifullin

OK thanks for the update. @edsiper Hi Eduardo, could you please take a look of https://github.com/fluent/fluent-bit/pull/5858 then? Thanks!

Sep 01 '22 13:09 JeffLuoo

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Dec 01 '22 02:12 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Dec 07 '22 02:12 github-actions[bot]