chunkio icon indicating copy to clipboard operation
chunkio copied to clipboard

scan: delete corrupted chunks

Open jamiees2 opened this issue 4 years ago • 6 comments

When we scan for chunks on startup, we load the list into memory, but exclude all chunks which are ultimately corrupted. This means that the library consumers are never actually informed about the corrupted chunks, and as a result, can't act on them to delete them.

This just deletes corrupted chunks on scan, but I'm interested in if we want to place this behind a config flag (and enable it in fluent bit), or want to do something more.

This does fix our immediate issue with 5 minutes of corrupted chunks on startup though :)

jamiees2 avatar Aug 22 '21 12:08 jamiees2

Before:

ESC[1mFluent Bit v1.8.3ESC[0m
* ESC[1mESC[93mCopyright (C) 2019-2021 The Fluent Bit AuthorsESC[0m
* ESC[1mESC[93mCopyright (C) 2015-2018 Treasure DataESC[0m
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/08/17 19:06:36] [ info] [engine] started (pid=22959)
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/24108-1627036093.415782167.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626681987.260484982.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/27785-1628312566.811558862.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626635752.528011015.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626680693.425352689.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626596863.504571931.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626652882.737676217.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626636338.941414083.flb
[2021/08/17 19:06:36] [error] [storage] format check failed: forward.0/12071-1626684643.644410485.flb

for 5 minutes

After:

[2021/08/22 11:45:59] [ info] [engine] started (pid=16928)
[2021/08/22 11:45:59] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2021/08/22 11:45:59] [debug] [storage] [cio scan] opening path /var/log/td-agent-bit/cache/
[2021/08/22 11:45:59] [debug] [storage] [cio stream] new stream registered: forward.0
[2021/08/22 11:45:59] [debug] [storage] [cio scan] opening stream forward.0
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626637721.508111593.flb//var/log/td-agent-bit/cache//forward.0/10083-1626637721.508111593.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626637721.508111593.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626637721.508111593.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626687577.352016009.flb//var/log/td-agent-bit/cache//forward.0/10083-1626687577.352016009.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626687577.352016009.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626687577.352016009.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 17840-1628486295.657410446.flb//var/log/td-agent-bit/cache//forward.0/17840-1628486295.657410446.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/17840-1628486295.657410446.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 17840-1628486295.657410446.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626677477.220090314.flb//var/log/td-agent-bit/cache//forward.0/10083-1626677477.220090314.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626677477.220090314.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626677477.220090314.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 27637-1627293232.563310409.flb//var/log/td-agent-bit/cache//forward.0/27637-1627293232.563310409.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/27637-1627293232.563310409.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 27637-1627293232.563310409.flb
[2021/08/22 11:45:59] [debug] [storage] [cio file] invalid crc32 at 10083-1626599501.871938221.flb//var/log/td-agent-bit/cache//forward.0/10083-1626599501.871938221.flb
[2021/08/22 11:45:59] [error] [storage] format check failed: forward.0/10083-1626599501.871938221.flb
[2021/08/22 11:45:59] [debug] [storage] [cio scan] Found corrupted chunk 10083-1626599501.871938221.flb

The startup still took 5+ minutes, but the next startup was super fast :D

jamiees2 avatar Aug 22 '21 12:08 jamiees2

hey @edsiper, is there a chance you can take a look at this PR soon/propose an alternative way to address this bug?

jamiees2 avatar Aug 24 '21 22:08 jamiees2

I am thinking about what would be the ideal use case. Note that a corrupted chunk can have "many" valid records, what we do in Fluent Bit is process all valid chunks that exist in a corrupted chunk. If we just skip that functionality we will be in a worse situation...

edsiper avatar Aug 25 '21 00:08 edsiper

so I spent some time looking through this today, I believe this isn't correct, as far as I can tell the execution flow seems to be the following:

On startup, fluent bit calls cio_load, which eventually calls out to cio_scan_stream_files with a directory. cio_scan_stream_files loops through the files in this directory and runs cio_chunk_open over each of them.

cio_chunk_open takes this chunk, allocates a struct cio_chunk for it, and starts populating it, and adds it to the struct cio_stream -> chunks list It eventually populates the backend using cio_file_open if the stream is backed by the filesystem.

cio_file_open will check if it should open the file, try to open it, and then try to mmap it. If any of these steps fail, it will return NULL, and populate err.

If cio_chunk_open receives a NULL as a return value, it will delete the chunk from the struct cio_stream -> chunks list and free everything. So as a result, from fluent bit's perspective, the file is never loaded at all, and all records of it are cleaned up.

As a result, my belief is that we actually never ever load these files, since the only way we scan them is through cio_scan_stream_files, which means we can't use them to load data.

I'm wondering if there's something I'm missing here, could you potentially link me to some code?

jamiees2 avatar Aug 26 '21 17:08 jamiees2

If cio_chunk_open receives a NULL as a return value, it will delete the chunk from the struct cio_stream -> chunks list and free everything. So as a result, from fluent bit's perspective, the file is never loaded at all, and all records of it are cleaned up.

which line is that ?, what conditions would trigger that scenario ?

edsiper avatar Oct 07 '21 22:10 edsiper

I'm referring to these lines: https://github.com/edsiper/chunkio/blob/510a277bc7c290c7a00012b2a6df5555e35947f6/src/cio_chunk.c#L78 https://github.com/edsiper/chunkio/blob/510a277bc7c290c7a00012b2a6df5555e35947f6/src/cio_chunk.c#L85-L90

In cio_file.c, cio_file_open returns NULL if mmap_file returns an error, which includes CIO_CORRUPTED, see https://github.com/edsiper/chunkio/blob/510a277bc7c290c7a00012b2a6df5555e35947f6/src/cio_file.c#L615-L620

CIO_CORRUPTED is what is returned for chunks which fail the crc32 format check, so my belief is that these files are not loaded again at all.

jamiees2 avatar Oct 08 '21 11:10 jamiees2