Is the encoding parameter being used as the Documentation states ?
Describe the bug
The Official Documentation states regarding the encoding for the tail plugin:
encoding,from_encoding
type default version string nil (string encoding is ASCII-8BIT)0.14.0 Specifies the encoding of reading lines.
By default,
in_tailemits string value as ASCII-8BIT encoding.These options change it:
If
encodingis specified,in_tailchanges string toencoding.This uses Ruby's
String#force_encoding.If
encodingandfrom_encodingboth are specified,in_tailtries toencode string from
from_encodingtoencoding. This uses Ruby's
source: tail#encoding-from_encoding
I have been checking Fluentd source code and:
-
Regarding the first bullet. I think
encodingparameter is not being used as it states in the Documentation. I cannot find the functionString#force_encodingusing theencodingparameter. On the other side I have found theString#force_encodingfunction with thefrom_encodingparameter in few places. I think line 992 might be wrong: https://github.com/fluent/fluentd/blob/74db9477f445ef83384eca6da8d6c2049945d8cd/lib/fluent/plugin/in_tail.rb#L992 If the Documentation is not wrong the functionString#force_encodingshould use theencodingvalue not thefrom_encodingvalue. -
Regarding the second bullet. It states the
String#encodefunction is used whenfrom _encodingparameter is set but it seemsString#encodeis used by default is you setencodingparameter to something different thanASCII-8BITbecausefrom_encodingis set by default toASCII-8BIT. For example,String#encodeis used if you setencodingparameter toUTF-8but according to the DocumentationString#force_encodingshould be used when you set theencodingparameter and notString#encode.
To Reproduce
Just start a Fluentd container with GROK plugin.
Then run the command:
td-agent --config /home/td-agent/fluentd.conf
Expected behavior
2023-11-23 15:58:46.005162458 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005176717 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
Your Environment
- Fluentd version: 1.11.2
- TD Agent version: 1.11.2
- Operating system: Alma Linux 9
- Kernel version: Linux 5.14.0-284.30.1.el9_2.x86_64 x86_64
Your Configuration
# /home/td-agent/patterns.conf
CUSTOM_LOG_WORKS %{TIMESTAMP_ISO8601:timestamp} %{GREEDYDATA:message}
# HTTPDATE has ä character
# Source: https://github.com/fluent/fluent-plugin-grok-parser/blob/903dfe222984b90c4e1c1151530038d1f242157d/patterns/legacy/grok-patterns#L51
CUSTOM_LOG_FAILS %{HTTPDATE:timestamp} %{NUMBER:response}
# /tmp/encoding-test.log
2023-11-22 18:18:09.823+0100 Testing Zürich
2023-11-22 18:18:09.823+0100 Testing Geneva
# /home/td-agent/fluentd.conf
<source>
@type tail
path /tmp/encoding-test.log
read_from_head true
encoding UTF-8
tag encoding
<parse>
@type grok
grok_failure_key grokfailure
custom_pattern_path /home/td-agent/patterns.conf
<grok>
pattern %{CUSTOM_LOG_FAILS:message}
</grok>
<grok>
pattern %{CUSTOM_LOG_WORKS:message}
</grok>
</parse>
</source>
<match encoding>
@type stdout
</match>
Your Error Log
[td-agent@dc60c1c5967e ~]$ /opt/td-agent/bin/fluentd --config /home/td-agent/fluentd.conf
2023-11-23 15:58:45 +0100 [info]: parsing config file is succeeded path="/home/td-agent/fluentd.conf"
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '4.2.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '4.1.1'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-grok-parser' version '2.6.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-kafka' version '0.14.1'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-prometheus' version '1.8.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-prometheus_pushgateway' version '0.0.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.3.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-s3' version '1.4.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-td' version '1.1.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-webhdfs' version '1.2.5'
2023-11-23 15:58:45 +0100 [info]: gem 'fluentd' version '1.11.2'
2023-11-23 15:58:45 +0100 [info]: Expanded the pattern %{CUSTOM_LOG_FAILS:message} into (?<message>(?<timestamp>(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))/(?:\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b)/(?:(?>\d\d){1,2}):(?:(?!<[0-9])(?:(?:2[0123]|[01]?[0-9])):(?:(?:[0-5][0-9]))(?::(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))(?![0-9])) (?:(?:[+-]?(?:[0-9]+)))) (?<response>(?:(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))))))
2023-11-23 15:58:45 +0100 [info]: Expanded the pattern %{CUSTOM_LOG_WORKS:message} into (?<message>(?<timestamp>(?:(?>\d\d){1,2})-(?:(?:0?[1-9]|1[0-2]))-(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))[T ](?:(?:2[0123]|[01]?[0-9])):?(?:(?:[0-5][0-9]))(?::?(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))?(?:(?:Z|[+-](?:(?:2[0123]|[01]?[0-9]))(?::?(?:(?:[0-5][0-9])))))?) (?<message>.*))
2023-11-23 15:58:45 +0100 [warn]: 'pos_file PATH' parameter is not set to a 'tail' source.
2023-11-23 15:58:45 +0100 [warn]: this parameter is highly recommended to save the position to resume tailing.
2023-11-23 15:58:45 +0100 [info]: using configuration file: <ROOT>
<source>
@type tail
path "/tmp/encoding-test.log"
tag "encoding"
read_from_head true
encoding "UTF-8"
<parse>
@type "grok"
grok_failure_key "grokfailure"
custom_pattern_path "/home/td-agent/patterns.conf"
unmatched_lines
<grok>
pattern "%{CUSTOM_LOG_FAILS:message}"
</grok>
<grok>
pattern "%{CUSTOM_LOG_WORKS:message}"
</grok>
</parse>
</source>
<match encoding>
@type stdout
</match>
</ROOT>
2023-11-23 15:58:45 +0100 [info]: starting fluentd-1.11.2 pid=715 ruby="2.7.1"
2023-11-23 15:58:45 +0100 [info]: spawn command to main: cmdline=["/opt/td-agent/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/td-agent/bin/fluentd", "--config", "/home/td-agent/fluentd.conf", "--under-supervisor"]
2023-11-23 15:58:45 +0100 [info]: adding match pattern="encoding" type="stdout"
2023-11-23 15:58:45 +0100 [info]: adding source type="tail"
2023-11-23 15:58:46 +0100 [info]: #0 Expanded the pattern %{CUSTOM_LOG_FAILS:message} into (?<message>(?<timestamp>(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))/(?:\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b)/(?:(?>\d\d){1,2}):(?:(?!<[0-9])(?:(?:2[0123]|[01]?[0-9])):(?:(?:[0-5][0-9]))(?::(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))(?![0-9])) (?:(?:[+-]?(?:[0-9]+)))) (?<response>(?:(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))))))
2023-11-23 15:58:46 +0100 [info]: #0 Expanded the pattern %{CUSTOM_LOG_WORKS:message} into (?<message>(?<timestamp>(?:(?>\d\d){1,2})-(?:(?:0?[1-9]|1[0-2]))-(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))[T ](?:(?:2[0123]|[01]?[0-9])):?(?:(?:[0-5][0-9]))(?::?(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))?(?:(?:Z|[+-](?:(?:2[0123]|[01]?[0-9]))(?::?(?:(?:[0-5][0-9])))))?) (?<message>.*))
2023-11-23 15:58:46 +0100 [warn]: #0 'pos_file PATH' parameter is not set to a 'tail' source.
2023-11-23 15:58:46 +0100 [warn]: #0 this parameter is highly recommended to save the position to resume tailing.
2023-11-23 15:58:46 +0100 [info]: #0 starting fluentd worker pid=720 ppid=715 worker=0
2023-11-23 15:58:46 +0100 [info]: #0 following tail of /tmp/encoding-test.log
2023-11-23 15:58:46.005131856 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005146527 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005152826 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005157747 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005162458 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005176717 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46 +0100 [info]: #0 fluentd worker is now running worker=0
Additional details
If I set both encoding parameters to UTF-8 I get a warning on the Fluentd logs but the special characters are represented. I don't know if this is the proper way to represent the special characters since I get a warning. Shouldn't this warning be change to info ?
Configuration
@type tail
path "/tmp/encoding-test.log"
tag "encoding"
read_from_head true
from_encoding "UTF-8"
encoding "UTF-8"
Warning
2023-11-23 14:44:12 +0100 [warn]: #0 fluent/log.rb:348:warn: 'encoding' and 'from_encoding' are same encoding. No effect
Output
2023-11-23 14:44:12.044957269 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 14:44:12.044962081 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
Documentation not clear or wrong
Another option could be that Fluentd works as expected but the Documentation is not clear enough or it's wrong.
Thanks for your report! Obviously the documentation is wrong.
- The first bullet is incorrect: When only
encodingparameter is set, in_tail tries to convert input string fromASCII-8BITtoencoding- Ruby tries to convert the original string from
ASCII-8BITtoUTF-8before converting it toencoding.
- Ruby tries to convert the original string from
- The second bullet is correct.
What do you mean when you say the following ?
- Ruby tries to convert the original string from
ASCII-8BITtoUTF-8before converting it toencoding.
Do you mean there are two encoding process ?
If I'm not wrong by default both from_encoding and encoding value is ASCII-8BIT. So by default the encode function is not called.