Broken Connection Errors Resolved with Keep alive config
Bug Report
Some plugins such as Cloudwatch have broken connection errors that occur from time to time. We found that in some cases, these errors can be resolved by setting net.keepalive off on the output plugin or net.keepalive_max_recycle <10_or_some_other_number,_try_find_max_without_errors>.
like
[OUTPUT]
Name cloudwatch_logs
# Add the following line
net.keepalive off
or
[OUTPUT]
Name cloudwatch_logs
# Add the following line
net.keepalive_max_recycle 10
It appears that this may be because the connection that is kept open to the external api goes stale eventually after being recycled several times. Not clear yet if this issue is resolved in the latest release 1.8.12.
Describe the bug
Errors such as the following occur
[2022/01/27 23:00:17] [error] [http_client] broken connection to logs.us-east-1.amazonaws.com:443
To Reproduce Still working on minimal reproduction of the bug.
Expected behavior
Not clear what the expected behavior should be for broken pipe error on keep alive recycled connection usage. Potentially connection should be replaced. Though not sure if that would disrupt state.
Screenshots
Your Environment
- Version used: 1.8.9
- Configuration:
[SERVICE]
Flush 1
Grace 30
Log_Level info
[INPUT]
Name tcp
Tag ApplicationLogs
Listen 0.0.0.0
Port 5170
Format none
[INPUT]
Name tcp
Tag RequestLogs
Listen 0.0.0.0
Port 5171
Format none
[INPUT]
Name tail
Tag dd
Path x
Rotate_Wait 15
Multiline On
Parser_Firstline QueryLogSeparator
Parser_1 QueryLog
[OUTPUT]
Name cloudwatch_logs
Match xx-*
region ${LOG_REGION}
log_group_name x
log_stream_prefix x
auto_create_group false
[OUTPUT]
Name cloudwatch_logs
Match aa
region x
log_group_name x
log_stream_prefix x
log_key x
auto_create_group false
[OUTPUT]
Name cloudwatch_logs
Match bb
region x
log_group_name x
log_stream_prefix x
log_key log
auto_create_group false
[OUTPUT]
Name cloudwatch_logs
Match cc
region x
log_group_name x
log_stream_prefix x
log_key log
auto_create_group false
- Environment name and version (e.g. Kubernetes? What version?):
- Server type and version:
- Operating System and version: Amazon Linux
- Filters and plugins:
Additional context
Client experiencing large number of broken connection errors when using cloudwatch.
Note: I sent a patch to add unit test for flb_upstream. https://github.com/fluent/fluent-bit/pull/4756
A test case upstream_keepalive_multi_thread is disabled since it causes SIGSEGV.
net.keepalive of the test case is FLB_TRUE.
It will not cause SIGSEGV if net.keepalive is FLB_FALSE.
I also think the keep alive config has a bug.
It will not cause SIGSEGV if net.keepalive is FLB_FALSE.
It is fixed. My code bug...
Other thread has to call flb_engine_evl_init() and flb_engine_evl_set() to init evl.
https://github.com/fluent/fluent-bit/pull/4756#issuecomment-1030775820
Hi @matthewfala, is this still reproducible in the latest versions? (either 1.8.15 or 1.9.2) We've resolved some connection-related issues (see https://github.com/fluent/fluent-bit/issues/4505), so please let us know if this is still reproducible.
@lecaros, we just experienced this issue last night with at least one fluent-bit v1.9.1 instance in our on-prem kubernetes cluster. It just hangs in re-schedule|retry mode and did not recover until now. We also had some internal firewall/network maintenance, around that time.
The related log:
...
[2022/05/04 15:38:45] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1048752 watch_fd=228 name=/var/log/containers/XXX-d8589d875-dbx9z_default_XXX-973117445fa0204a7016bbfd6a4bfd396311048522434dd1c04136e9e65265bb.log
[2022/05/04 20:59:04] [error] [http_client] broken connection to our-internal-loki:3100 ?
[2022/05/04 20:59:04] [error] [output:loki:loki.0] could not flush records to our-internal-loki:3100 (http_do=-1)
[2022/05/04 20:59:04] [ warn] [engine] failed to flush chunk '1-1651697065.510035470.flb', retry in 6 seconds: task_id=96, input=tail.0 > output=loki.0 (out_id=0)
[2022/05/04 20:59:09] [ info] [task] re-schedule retry=0x7f2a81c6cf40 96 in the next 7 seconds
[2022/05/04 20:59:16] [ info] [task] re-schedule retry=0x7f2a81c6cf40 96 in the next 6 seconds
[2022/05/04 20:59:22] [ info] [task] re-schedule retry=0x7f2a81c6cf40 96 in the next 8 seconds
...
Our OUTPUT looks like so:
[OUTPUT]
name loki
match *_default_*
host our-internal-loki
port 3100
labels k8s_cluster=XXX, k8s_ns=$kubernetes['namespace_name'], k8s_host=$kubernetes['host'], k8s_pod_name=$kubernetes['pod_name'], k8s_pod_id=$kubernetes['pod_id'], k8s_container_name=$kubernetes['container_name']
auto_kubernetes_labels on
storage.total_limit_size 500M
I finally killed the pod and now it's catching up with the container logs.
Anyhow, I am planning to update to latest v1.9.3.
I still see this issue with endless re-schedule retry=0xXXX ## again in several instances of flb in our clusters even with 1.9.3 and this configuration:
[OUTPUT]
name loki
match *_default_*
host our-internal-loki
port 3100
labels k8s_cluster=XXX, k8s_ns=$kubernetes['namespace_name'], k8s_host=$kubernetes['host'], k8s_pod_name=$kubernetes['pod_name'], k8s_pod_id=$kubernetes['pod_id'], k8s_container_name=$kubernetes['container_name']
auto_kubernetes_labels on
storage.total_limit_size 500M
net.keepalive on
net.keepalive_max_recycle 64
I am now trying to disable net.keepalive completely. I also found some network related issues on other flb instances at around the same time when those issues start. E.g. like the target Loki instance was responding with 500 error. Which still would not explain why some few flb instance after such incident are not recovering.
Still I am not sure if my issue is more related to https://github.com/fluent/fluent-bit/issues/5217? Or if both issues correspond to the same problem origin.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.
I needed to change net.keepalive_idle_timeout to resolve the issue.
I needed to change
net.keepalive_idle_timeoutto resolve the issue.
what value have you changed? have u increase or descrease?