fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Broken Connection Errors Resolved with Keep alive config

Open matthewfala opened this issue 4 years ago • 7 comments

Bug Report

Some plugins such as Cloudwatch have broken connection errors that occur from time to time. We found that in some cases, these errors can be resolved by setting net.keepalive off on the output plugin or net.keepalive_max_recycle <10_or_some_other_number,_try_find_max_without_errors>.

like

[OUTPUT]
    Name              cloudwatch_logs
    # Add the following line
    net.keepalive              off

or

[OUTPUT]
    Name              cloudwatch_logs
    # Add the following line
    net.keepalive_max_recycle              10

It appears that this may be because the connection that is kept open to the external api goes stale eventually after being recycled several times. Not clear yet if this issue is resolved in the latest release 1.8.12.

Describe the bug

Errors such as the following occur

[2022/01/27 23:00:17] [error] [http_client] broken connection to logs.us-east-1.amazonaws.com:443

To Reproduce Still working on minimal reproduction of the bug.

Expected behavior

Not clear what the expected behavior should be for broken pipe error on keep alive recycled connection usage. Potentially connection should be replaced. Though not sure if that would disrupt state.

Screenshots

Your Environment

  • Version used: 1.8.9
  • Configuration:
[SERVICE]
    Flush        1
    Grace        30
    Log_Level    info
[INPUT]
    Name        tcp
    Tag         ApplicationLogs
    Listen      0.0.0.0
    Port        5170
    Format      none

[INPUT]
    Name        tcp
    Tag         RequestLogs
    Listen      0.0.0.0
    Port        5171
    Format      none

[INPUT]
    Name             tail
    Tag              dd
    Path             x
    Rotate_Wait      15
    Multiline        On
    Parser_Firstline QueryLogSeparator
    Parser_1         QueryLog

[OUTPUT]
    Name              cloudwatch_logs
    Match             xx-*
    region            ${LOG_REGION}
    log_group_name    x
    log_stream_prefix x
    auto_create_group false

[OUTPUT]
    Name              cloudwatch_logs
    Match             aa
    region            x
    log_group_name    x
    log_stream_prefix x
    log_key           x
    auto_create_group false

[OUTPUT]
    Name              cloudwatch_logs
    Match             bb
    region            x
    log_group_name    x
    log_stream_prefix x
    log_key           log
    auto_create_group false

[OUTPUT]
    Name              cloudwatch_logs
    Match             cc
    region            x
    log_group_name    x
    log_stream_prefix x
    log_key           log
    auto_create_group false

  • Environment name and version (e.g. Kubernetes? What version?):
  • Server type and version:
  • Operating System and version: Amazon Linux
  • Filters and plugins:

Additional context

Client experiencing large number of broken connection errors when using cloudwatch.

matthewfala avatar Jan 28 '22 02:01 matthewfala

Note: I sent a patch to add unit test for flb_upstream. https://github.com/fluent/fluent-bit/pull/4756

A test case upstream_keepalive_multi_thread is disabled since it causes SIGSEGV. net.keepalive of the test case is FLB_TRUE. It will not cause SIGSEGV if net.keepalive is FLB_FALSE.

I also think the keep alive config has a bug.

nokute78 avatar Feb 06 '22 07:02 nokute78

It will not cause SIGSEGV if net.keepalive is FLB_FALSE.

It is fixed. My code bug... Other thread has to call flb_engine_evl_init() and flb_engine_evl_set() to init evl. https://github.com/fluent/fluent-bit/pull/4756#issuecomment-1030775820

nokute78 avatar Feb 06 '22 08:02 nokute78

Hi @matthewfala, is this still reproducible in the latest versions? (either 1.8.15 or 1.9.2) We've resolved some connection-related issues (see https://github.com/fluent/fluent-bit/issues/4505), so please let us know if this is still reproducible.

lecaros avatar Apr 12 '22 21:04 lecaros

@lecaros, we just experienced this issue last night with at least one fluent-bit v1.9.1 instance in our on-prem kubernetes cluster. It just hangs in re-schedule|retry mode and did not recover until now. We also had some internal firewall/network maintenance, around that time.

The related log:

...
[2022/05/04 15:38:45] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1048752 watch_fd=228 name=/var/log/containers/XXX-d8589d875-dbx9z_default_XXX-973117445fa0204a7016bbfd6a4bfd396311048522434dd1c04136e9e65265bb.log
[2022/05/04 20:59:04] [error] [http_client] broken connection to our-internal-loki:3100 ?
[2022/05/04 20:59:04] [error] [output:loki:loki.0] could not flush records to our-internal-loki:3100 (http_do=-1)
[2022/05/04 20:59:04] [ warn] [engine] failed to flush chunk '1-1651697065.510035470.flb', retry in 6 seconds: task_id=96, input=tail.0 > output=loki.0 (out_id=0)
[2022/05/04 20:59:09] [ info] [task] re-schedule retry=0x7f2a81c6cf40 96 in the next 7 seconds
[2022/05/04 20:59:16] [ info] [task] re-schedule retry=0x7f2a81c6cf40 96 in the next 6 seconds
[2022/05/04 20:59:22] [ info] [task] re-schedule retry=0x7f2a81c6cf40 96 in the next 8 seconds
...

Our OUTPUT looks like so:

    [OUTPUT]
        name                      loki
        match                     *_default_*
        host                      our-internal-loki
        port                      3100
        labels                    k8s_cluster=XXX, k8s_ns=$kubernetes['namespace_name'], k8s_host=$kubernetes['host'], k8s_pod_name=$kubernetes['pod_name'], k8s_pod_id=$kubernetes['pod_id'], k8s_container_name=$kubernetes['container_name']
        auto_kubernetes_labels    on
        storage.total_limit_size  500M

I finally killed the pod and now it's catching up with the container logs.

Anyhow, I am planning to update to latest v1.9.3.

mhoyer avatar May 05 '22 10:05 mhoyer

I still see this issue with endless re-schedule retry=0xXXX ## again in several instances of flb in our clusters even with 1.9.3 and this configuration:

   [OUTPUT]
        name                      loki
        match                     *_default_*
        host                      our-internal-loki
        port                      3100
        labels                    k8s_cluster=XXX, k8s_ns=$kubernetes['namespace_name'], k8s_host=$kubernetes['host'], k8s_pod_name=$kubernetes['pod_name'], k8s_pod_id=$kubernetes['pod_id'], k8s_container_name=$kubernetes['container_name']
        auto_kubernetes_labels    on
        storage.total_limit_size  500M
        net.keepalive             on
        net.keepalive_max_recycle 64

I am now trying to disable net.keepalive completely. I also found some network related issues on other flb instances at around the same time when those issues start. E.g. like the target Loki instance was responding with 500 error. Which still would not explain why some few flb instance after such incident are not recovering.

mhoyer avatar May 12 '22 11:05 mhoyer

Still I am not sure if my issue is more related to https://github.com/fluent/fluent-bit/issues/5217? Or if both issues correspond to the same problem origin.

mhoyer avatar May 12 '22 11:05 mhoyer

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Aug 11 '22 02:08 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Aug 16 '22 02:08 github-actions[bot]

I needed to change net.keepalive_idle_timeout to resolve the issue.

sosheskaz avatar Jun 02 '23 21:06 sosheskaz

I needed to change net.keepalive_idle_timeout to resolve the issue.

what value have you changed? have u increase or descrease?

captainpro-eng avatar Feb 29 '24 09:02 captainpro-eng