Fluent-bit v1.8.15 and v1.9.3 output azure on Windows fails to connect to Log Analytics
Bug Report
Describe the bug
When running fluent-bit 1.8.15 or 1.9.3 on Windows containers (Windows Server 2019 Datacenter 10.0.17763.2686 containerd://1.6.1) with Kubernetes 1.22.8 the output plugin azure output connection error and fails to sent data to Log Analytics.
The same error appears both with servercore and nanoserver images as both ContainerUser and ContainerAdministrator user.
To Reproduce
- Example log message:
[2022/04/28 15:16:25] [debug] [input chunk] update output instances with new chunk size diff=1027
[2022/04/28 15:16:25] [debug] [task] created task=0000029E74F3BF80 id=0 OK
[2022/04/28 15:16:25] [error] [tls] error: unexpected EOF
[2022/04/28 15:16:25] [debug] [upstream] connection #1116 failed to 3863cb67-6c46-4780-854d-5737842a4d18.ods.opinsights.azure.com:443
[2022/04/28 15:16:25] [debug] [out flush] cb_destroy coro_id=0
[2022/04/28 15:16:25] [debug] [retry] new retry created for task_id=0 attempts=1
[2022/04/28 15:16:25] [ warn] [engine] failed to flush chunk '5548-1651158984.511061600.flb', retry in 6 seconds: task_id=0, input=tail.0 > output=azure.0 (out_id=0)
[2022/04/28 15:16:26] [debug] [input chunk] update output instances with new chunk size diff=1027
[2022/04/28 15:16:26] [debug] [task] created task=0000029E74F3B180 id=1 OK
[2022/04/28 15:16:26] [error] [tls] error: unexpected EOF
[2022/04/28 15:16:26] [debug] [upstream] connection #1152 failed to 3863cb67-6c46-4780-854d-5737842a4d18.ods.opinsights.azure.com:443
[2022/04/28 15:16:26] [debug] [out flush] cb_destroy coro_id=1
[2022/04/28 15:16:26] [debug] [retry] new retry created for task_id=1 attempts=1
[2022/04/28 15:16:26] [ warn] [engine] failed to flush chunk '5548-1651158986.5265500.flb', retry in 9 seconds: task_id=1, input=tail.0 > output=azure.0 (out_id=0)
[2022/04/28 15:16:27] [debug] [input chunk] update output instances with new chunk size diff=1027
[2022/04/28 15:16:27] [debug] [task] created task=0000029E74F3BD00 id=2 OK
[2022/04/28 15:16:27] [error] [tls] error: unexpected EOF
- Steps to reproduce the problem:
- Start a plain windows server core or nanoserver container
- Download and install fluent-bit from zip-file
- Run fluent-bit with configuration (see below)
- Error appears
Expected behavior
Output plugin should successfully send data to Log Analytics.
Your Environment
- Version used:
1.8.15and1.9.3 - Configuration:
- Environment name and version (e.g. Kubernetes? What version?):
-
Kubernetes 1.22.8 -
Windows Server 2019 Datacenter 10.0.17763.2686 containerd://1.6.1
-
- Server type and version:
N/A - Operating System and version:
-
mcr.microsoft.com/windows/nanoserver:1809runtime container -
mcr.microsoft.com/windows/servercore:1809runtime container
-
- Filters and plugins:
- Input:
tail - Filter:
kubernetes - Output:
azure - Parser:
cri
- Input:
Additional context Config:
[SERVICE]
Flush 1
Log_Level trace
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
[INPUT]
Name tail
Tag kube.*
Path C:\\var\\log\\containers\\fluent-bit*.log
Parser cri
DB C:\\var\\flb\\tail_cri.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc.cluster.local:443
Kube_CA_File C:\\var\\run\\secrets\\kubernetes.io\\serviceaccount\\ca.crt
Kube_Token_File C:\\var\\run\\secrets\\kubernetes.io\\serviceaccount\\token
Kube_Tag_Prefix kube.C.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[OUTPUT]
Name azure
Match *
tls on
tls.debug 4
Customer_ID 3863cb67-6c46-4780-854d-5737842a4d18
Shared_Key <redacted>
[PARSER]
# http://rubular.com/r/tjUt3Awgg4
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
I am seeing similar issue when trying to use the cloudwatch_logs output in a Windows-based FluentBit container on Kubernetes. I see the same [tls] error: unexpected EOF when trying to connect to AWS STS and CloudWatch.
I suspect something is going wrong when trying to negotiate TLS in Windows containers.
FluentBit Version: 1.9.1 Windows OS (K8s node): Server 2019 Image base: mcr.microsoft.com/windows/servercore:ltsc2019
As a workaround I configured the output with tls.verify Off. Not optimal, but it gets the job done for now.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
Please remove stale
As a workaround I configured the output with
tls.verify Off. Not optimal, but it gets the job done for now.
@desek and @bryangardner I am facing a similar issue on a windows node in EKS cluster. Fluenentbit logs are similar to https://github.com/fluent/fluent-bit/issues/4727
I tried 'tls.verify Off' in output but errors persist. Any suggestion to workaround this?
{"log":"[2022/08/11 07:18:33] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with Kubelet...\r\n","stream":"stderr","time":"2022-08-11T07:18:33.5290575Z"} {"log":"[2022/08/11 07:18:33] [debug] [filter:kubernetes:kubernetes.0] Send out request to Kubelet for pods information.\r\n","stream":"stderr","time":"2022-08-11T07:18:33.5296845Z"} {"log":"[2022/08/11 07:18:34] [error] [tls] C:\src\src\tls\mbedtls.c:390 NET - Sending information through the socket failed\r\n","stream":"stderr","time":"2022-08-11T07:18:34.5385134Z"} {"log":"[2022/08/11 07:18:34] [debug] [upstream] connection #792 failed to 127.0.0.1:10250\r\n","stream":"stderr","time":"2022-08-11T07:18:34.5385134Z"} {"log":"[2022/08/11 07:18:34] [error] [filter:kubernetes:kubernetes.0] kubelet upstream connection error\r\n","stream":"stderr","time":"2022-08-11T07:18:34.5385134Z"} {"log":"[2022/08/11 07:18:34] [ warn] [filter:kubernetes:kubernetes.0] could not get meta for POD fluent-bit-windows-92pjh\r\n","stream":"stderr","time":"2022-08-11T07:18:34.5385134Z"}
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.
Please re-open, I am seeing the same issue when using 2.0.5 on Windows 2022 nodes in AKS Disabling the tls.verify works and logs are pushed to LAW.
Kubernetes version: 1.24.6 Node image: AKSWindows-2022-containerd-20348.1131.221019 image: ghcr.io/fluent/fluent-bit/staging:windows-2022-2.0.5