fluent-bit crash when reading a log content containers charactor 'NUL'
Bug Report
Describe the bug
FluntBit was running as a pod in the k8s, I use it to collect logs from some log files. Sometimes it crashed, after some debug I found maybe it due to the special charactor in the log content.
To Reproduce the log generator script is normal like this:
#!/bin/bash
show_help() {
echo "Usage: $0 [NUM_LINES]"
echo "If NUM_LINES is not provided, it will default to print 100 lines of logs."
echo "NUM_LINES should be a positive integer representing the number of log lines to print."
exit 1
}
if [ $# -gt 1 ]; then
show_help
fi
if [ -z "$1" ]; then
LOG_LINES=100
else
if [[ "$1" =~ ^[0-9]+$ ]]; then
LOG_LINES=$1
else
echo "Error: The parameter should be a positive integer."
show_help
fi
fi
NODE_IP=$(hostname | awk '{print $1}')
POD_NAME=${POD:-"unknown-pod"}
extra_info="This is extra info for log line. Some more details here heiheihheiheihheiheiheiheiheiheiheiheiheiheiheiheiheiheiheiheihei."
for ((i = 0; i < LOG_LINES; i++)); do
TIMESTAMP=$(date +"%Y-%m-%d")
echo "$TIMESTAMP | Node IP: $NODE_IP | Count: $i | $extra_info"
sleep 2
done
most of the logs is normal.
before fluent-bit crashed, the last log shows there are lots of 'NUL' where should be normal string charactor ('This is extra info for log line. Some more d').
the original logs file is big i cannot attach it here.
Screenshots
And i can not find any '0000' when i vim the log file by setting :%!xxd
Your Environment
- Version used: 3.1.5
- Configuration:
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterFluentBitConfig
metadata:
name: fluent-bit-slurm-config
namespace: fluent
labels:
name: slurm-log
component: studio
spec:
service:
#logLevel: trace # off, error, warn, info, debug, and trace
parsersFile: parsers.conf
httpServer: true
flushSeconds: 1 # Interval to flush output
storage:
backlogMemLimit: 5M
path: /fluent-bit/tail/slurm.storage.backlog
sync: normal
maxChunksUp: 128
inputSelector:
matchLabels:
name: slurm-log
component: studio
parserSelector:
matchLabels:
name: slurm-log
component: studio
filterSelector:
matchLabels:
name: slurm-log
component: studio
outputSelector:
matchLabels:
name: slurm-log
component: studio
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: FluentBit
metadata:
name: fluent-slurm
namespace: fluent
labels:
name: slurm-log
component: studio
spec:
image: repo-addr/fluent/fluent-operator/fluent-bit:3.1.5
ports: []
resources:
requests:
cpu: 500m
memory: 250Mi
limits:
cpu: "2"
memory: 4Gi
fluentBitConfigName: fluent-bit-slurm-config
nodeSelector:
kubernetes.io/hostname: 10.31.19.40 # run in one node
tolerations:
- operator: Exists
volumesMounts:
- mountPath: /slurm
name: logs
- mountPath: /etc/localtime
name: localtime
positionDB:
hostPath:
path: /path/fluent-bit/ # tail插件db记录保存目录,根据实际情况修改
volumes:
- hostPath:
path: /path/logs/hpc # 日志主机路径,根据实际情况修改
name: logs
- hostPath:
path: /etc/localtime
name: localtime
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes
- Server type and version:
# kubectl version
Client Version: v1.28.4-r0-28.0.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.6-r0-28.0.39.7
- Operating System and version: Huawei Cloud EulerOS 2.0 (x86_64)
- Filters and plugins:
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterInput
metadata:
name: slurm
namespace: fluent
labels:
name: slurm-log
component: studio
spec:
logLevel: debug # off;error;warning;info;debug;trace
tail:
tag: "slurm.*"
path: "/slurm/hpc-job-*/hpc-job-*/slurm-*"
pathKey: "logPath"
key: message
bufferChunkSize: 32k
bufferMaxSize: 128k
multiline: false
refreshIntervalSeconds: 1 # refreshing the list of watched files in seconds
skipLongLines: true
db: /fluent-bit/tail/slurm.db
dbSync: Normal
storageType: filesystem
---
apiVersion: v1
kind: ConfigMap
metadata:
name: slurm-fluent-bit-lua
labels:
name: slurm-log
component: studio
namespace: fluent
data:
slurm-log.lua: |
function extract_fields(tag, timestamp, record)
local jobId, pod = string.match(tag, "slurm%.slurm%.([^%.]+)%.([^%.]+)%.*")
if jobId and pod then
record["job-id"] = jobId
record["pod"] = pod
return 2, timestamp, record
end
return -1, timestamp, record
end
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterFilter
metadata:
name: slurm
namespace: fluent
labels:
name: slurm-log
component: studio
spec:
matchRegex: "slurm.*"
filters:
- lua:
script:
key: slurm-log.lua
name: slurm-fluent-bit-lua
call: extract_fields
timeAsTable: true
---
Additional context
I delete it from db, remove the storage backlog, and rename the logfile make it mismatch, it's going ok after restart fluent-bit.
Then i rename this logfile back, everything goes ok.
Now I found it's crash when flush to kafka:
I know how to Reproduce:
- run a container or pod which just
cat this attached file
- make the fluent-bit to scrape and parse it from cri(containerd) logs
- it will crashed ,some logs are like this:
to offer more information, my config of fluent-bit is like this:
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterInput
metadata:
name: train
namespace: fluent
labels:
fluentbit.fluent.io/enabled: "true"
fluentbit.fluent.io/component: logging
spec:
logLevel: debug # off;error;warning;info;debug;trace
tail:
tag: "train.*"
path: "/var/log/containers/job-*.log"
pathKey: "logPath"
multiline: false
parser: "cri"
refreshIntervalSeconds: 5 # refreshing the list of watched files in seconds
skipLongLines: true
db: /fluent-bit/tail/train.db
dbSync: Normal
storageType: filesystem
---
apiVersion: v1
kind: ConfigMap
metadata:
name: train-fluent-bit-lua
labels:
fluentbit.fluent.io/enabled: "true"
fluentbit.fluent.io/component: logging
namespace: fluent
data:
train-log.lua: |
function filterLog(tag, timestamp, record)
local pod, namespace = string.match(tag, "train%.var%.log%.containers%.([^_]+)_([^_]+)_*")
if pod and namespace then
record["namespace"] = namespace
record["pod"] = pod
-- remove
record["logtag"] = nil
return 2, timestamp, record
else
-- If code equals -1, means that the record will be dropped.
-- If code equals 0, the record will not be modified,
-- otherwise if code equals 1, means the original timestamp and record have been modified so it must be replaced by the returned values from timestamp (second return value) and record (third return value).
-- If code equals 2, means the original timestamp is not modified and the record has been modified so it must be replaced by the returned values from record (third return value)
return -1, timestamp, record
end
end
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterFilter
metadata:
name: train
namespace: fluent
labels:
fluentbit.fluent.io/enabled: "true"
fluentbit.fluent.io/component: logging
spec:
matchRegex: "train.*"
#logLevel: debug # off;error;warning;info;debug;trace
filters:
- lua:
script:
key: train-log.lua
name: train-fluent-bit-lua
call: filterLog
timeAsTable: true
- modify:
rules:
- add:
node: ${HOST_IP}
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterOutput
metadata:
name: train-stdout
namespace: fluent
labels:
fluentbit.fluent.io/enabled: "true"
fluentbit.fluent.io/component: logging
spec:
matchRegex: "train.*"
stdout:
format: json_lines
jsonDateFormat: iso8601
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterOutput
metadata:
name: train-kafka
namespace: fluent
labels:
fluentbit.fluent.io/enabled: "true"
fluentbit.fluent.io/component: logging
spec:
matchRegex: "train.*"
kafka:
brokers: kafka-controller-0.kafka-controller-headless.kafka.svc.cluster.local:9092,kafka-controller-1.kafka-controller-headless.kafka.svc.cluster.local:9092,kafka-controller-2.kafka-controller-headless.kafka.svc.cluster.local:9092
rdkafka:
sasl.mechanisms: SCRAM-SHA-512
sasl.password: xxx
sasl.username: xxx
security.protocol: sasl_plaintext
timestampFormat: iso8601
topics: train-logs
the kafka cluster runs well:
$ kubectl -n kafka get po
kafka-controller-0 2/2 Running 0 28d
kafka-controller-1 2/2 Running 0 28d
kafka-controller-2 2/2 Running 0 28d
the pod which print this file:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: job-exception
name: job-exception
spec:
replicas: 1
selector:
matchLabels:
app: job-stdout
template:
metadata:
labels:
app: job-stdout
spec:
terminationGracePeriodSeconds: 1
containers:
- command:
- /bin/bash
- -c
- |
cat /test/crash_when_cat_new.txt
sleep infinity
image:xxx:1.0.0
name: job-stdout
volumeMounts:
- mountPath: /test
name: logs
volumes:
name: timezone
- hostPath:
path: /tmp # /tmp/crash_when_cat_new.txt
type: ""
name: logs
I have provided the steps for reproducing the issue, anyone can deal with it?
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.