fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

flb_encoding: charset encoding for input plugins

Open bluebike opened this issue 5 years ago • 13 comments

Adds library flb_encoding for doing charset encoding CHARSET => UTF8.

  • Uses lib/tutf8e-library.
  • Only 8-bit charsets are supported.
  • At first in_tail plugin is supported, later in_syslog and others.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing Before we can approve your change; please submit the following in a comment:

  • [ ] Example configuration file for the change
  • [x] Debug log output from testing the change
  • [ ] Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

  • [x] Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

bluebike avatar Aug 04 '20 20:08 bluebike

I build this over #2326 (those commits seems to be included in PR), so I would like that to be merged before this. TODO: example run, configuration, valgrind. @nigels-com @edsiper would this be usable?

bluebike avatar Aug 04 '20 20:08 bluebike

#2326 was just merged, @bluebike @nigels-com can you remove the tutf8e pieces from this PR ?

in addition, ideally we want separate commits for the core interface "encoding" and for the plugins being improved.

edsiper avatar Aug 07 '20 23:08 edsiper

Rebased on master (with #2326 merged). In two commits:

  1. flb_encoding
  2. in_tail changes

TODO(?)

  • [x] add in_syslog support
  • [x] test example

bluebike avatar Aug 11 '20 11:08 bluebike

Added encoding support to in_syslog . Compiling with FLB_UTF8_ENCODER=No worked also... didn't see any related warnings in my env (macOS 10..13).

bluebike avatar Aug 11 '20 17:08 bluebike

Did requested changes...

  • checking memory allocation
  • formatting checked.
  • rebased... to get clean 3 commits. (... I'll add test examples soon)

bluebike avatar Aug 11 '20 18:08 bluebike

in_tail: simple test run in shell. input contains ä,Ö and € characters encoded in windows-1252 (cp1252).


$  echo $'Test data'    > huuhaa.txt
$  echo $'This contains a+dots: \xe4 O+dots: \xd6. trailing data' >> huuhaa.txt
$  echo $'This contains euro character: \x80' >> huuhaa.txt



$   bin/fluent-bit -v -i tail -p path=huuhaa.txt -p 'encoding=windows-1252'  -o stdout

Fluent Bit v1.6.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/08/11 22:06:42] [ info] Configuration:
[2020/08/11 22:06:42] [ info]  flush time     | 5.000000 seconds
[2020/08/11 22:06:42] [ info]  grace          | 5 seconds
[2020/08/11 22:06:42] [ info]  daemon         | 0
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  inputs:
[2020/08/11 22:06:42] [ info]      tail
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  filters:
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  outputs:
[2020/08/11 22:06:42] [ info]      stdout.0
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  collectors:
[2020/08/11 22:06:42] [ info] [engine] started (pid=47950)
[2020/08/11 22:06:42] [debug] [engine] coroutine stack size: 12288 bytes (12.0K)
[2020/08/11 22:06:42] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/08/11 22:06:42] [ info] [storage] version=1.0.5, initializing...
[2020/08/11 22:06:42] [ info] [storage] in-memory
[2020/08/11 22:06:42] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] scanning path huuhaa.txt
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] inode=13109344 appended as huuhaa.txt
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] scan_glob add(): huuhaa.txt, inode 13109344
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] 1 new files found on path 'huuhaa.txt'
[2020/08/11 22:06:42] [debug] [router] default match rule tail.0:stdout.0
[2020/08/11 22:06:42] [ info] [sp] stream processor started
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] inode=13109344 file=huuhaa.txt promote to TAIL_EVENT
[2020/08/11 22:06:47] [debug] [task] created task=0x7f9b36d00a30 id=0 OK
[0] tail.0: [1597172802.983562000, {"log"=>"Test data"}]
[1] tail.0: [1597172802.983579000, {"log"=>"This contains a+dots: ä O+dots: Ö. trailing data"}]
[2] tail.0: [1597172802.983580000, {"log"=>"This contains euro character: €"}]
^C[engine] caught signal (SIGINT)
[2020/08/11 22:06:52] [ info] [input] pausing tail.0
[2020/08/11 22:06:52] [debug] [input:tail:tail.0] inode=13109344 removing file name huuhaa.txt

bluebike avatar Aug 11 '20 19:08 bluebike

in_syslog: test run using UDP syslog messages.

# start fluent-bit first.. then send these in different terminal (+ bash shell)

$ echo $'<135>Aug 11 20:27:22 myhost test: nothing'   | nc -w 1 -u 127.0.0.1  7700
$ echo $'<135>Aug 11 20:27:22 myhost test: euro: \x80'   | nc -w 1 -u 127.0.0.1  7700
$ echo $'<135>Aug 11 20:27:22 myhost test:  tama: t\xe4m\xe4'   | nc -w 1 -u 127.0.0.1  7700



$ bin/fluent-bit -v -R ../conf/parsers.conf -i syslog -p mode=udp  -p port=7700 -p Parser=syslog-rfc3164-local  -p encoding=windows-1252 -o stdout

Fluent Bit v1.6.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/08/11 22:13:51] [ info] Configuration:
[2020/08/11 22:13:51] [ info]  flush time     | 5.000000 seconds
[2020/08/11 22:13:51] [ info]  grace          | 5 seconds
[2020/08/11 22:13:51] [ info]  daemon         | 0
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  inputs:
[2020/08/11 22:13:51] [ info]      syslog
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  filters:
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  outputs:
[2020/08/11 22:13:51] [ info]      stdout.0
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  collectors:
[2020/08/11 22:13:51] [ info] [engine] started (pid=48082)
[2020/08/11 22:13:51] [debug] [engine] coroutine stack size: 12288 bytes (12.0K)
[2020/08/11 22:13:51] [debug] [storage] [cio stream] new stream registered: syslog.0
[2020/08/11 22:13:51] [ info] [storage] version=1.0.5, initializing...
[2020/08/11 22:13:51] [ info] [storage] in-memory
[2020/08/11 22:13:51] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/08/11 22:13:51] [ info] [in_syslog] UDP buffer size set to 32768 bytes
[2020/08/11 22:13:51] [ info] [in_syslog] UDP server binding 0.0.0.0:7700
[2020/08/11 22:13:51] [debug] [router] default match rule syslog.0:stdout.0
[2020/08/11 22:13:51] [ info] [sp] stream processor started
[0] syslog.0: [1597177642.000000000, {"pri"=>"135", "time"=>"Aug 11 20:27:22", "ident"=>"myhost", "message"=>"nothing"}
[0] syslog.0: [1597177642.000000000, {"pri"=>"135", "time"=>"Aug 11 20:27:22", "ident"=>"myhost", "message"=>"euro: €"}]
[0] syslog.0: [1597177642.000000000, {"pri"=>"135", "time"=>"Aug 11 20:27:22", "ident"=>"myhost", "message"=>"tama: tämä"}]

bluebike avatar Aug 11 '20 19:08 bluebike

Added documentation PR https://github.com/fluent/fluent-bit-docs/pull/410 and fixed possible memory leak if allocation fails in opening in flb_encoding_open

bluebike avatar Oct 28 '20 18:10 bluebike

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Apr 28 '21 01:04 github-actions[bot]

thanks. I wrote some comments.

About the second commit, note that it must be prefixed with in_tail: .....

Changed commit message.

bluebike avatar Jul 29 '21 18:07 bluebike

Any news on this? Would love to see this merged!

kingjan1999 avatar Nov 22 '21 20:11 kingjan1999

assigned to @nokute78 for review

edsiper avatar Dec 13 '21 00:12 edsiper

Could we get this included in mainline ASAP? I assume I am not the only one with non-UTF8 log entries .

As a user, I do not even really care how this is implemented but as for configuration, input seems the most convenient i.e. tail-plugin.

hpernu avatar Apr 18 '23 11:04 hpernu

Hi, I implemented generic conversion engine which includes GBK, ShiftJIS, UHC and other Windows-125X stuffs for converting character encodings here: https://github.com/fluent/fluent-bit/pull/10464 https://github.com/fluent/fluent-bit/pull/10542

This is enough for the most of use cases that are listed in this PR?

cosmo0920 avatar Jul 23 '25 07:07 cosmo0920