fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

fluent-bit crash when reading a log content containers charactor 'NUL'

Open FeiYing9 opened this issue 1 year ago • 1 comments

Bug Report

Describe the bug

FluntBit was running as a pod in the k8s, I use it to collect logs from some log files. Sometimes it crashed, after some debug I found maybe it due to the special charactor in the log content.

To Reproduce the log generator script is normal like this:

#!/bin/bash

show_help() {
    echo "Usage: $0 [NUM_LINES]"
    echo "If NUM_LINES is not provided, it will default to print 100 lines of logs."
    echo "NUM_LINES should be a positive integer representing the number of log lines to print."
    exit 1
}

if [ $# -gt 1 ]; then
    show_help
fi

if [ -z "$1" ]; then
    LOG_LINES=100
else
    if [[ "$1" =~ ^[0-9]+$ ]]; then
        LOG_LINES=$1
    else
        echo "Error: The parameter should be a positive integer."
        show_help
    fi
fi

NODE_IP=$(hostname | awk '{print $1}')
POD_NAME=${POD:-"unknown-pod"}


extra_info="This is extra info for log line. Some more details here heiheihheiheihheiheiheiheiheiheiheiheiheiheiheiheiheiheiheiheihei."

for ((i = 0; i < LOG_LINES; i++)); do
    TIMESTAMP=$(date +"%Y-%m-%d")
    echo "$TIMESTAMP | Node IP: $NODE_IP | Count: $i | $extra_info"
    sleep 2
done

most of the logs is normal.

before fluent-bit crashed, the last log shows there are lots of 'NUL' where should be normal string charactor ('This is extra info for log line. Some more d').

the original logs file is big i cannot attach it here.

Screenshots image

And i can not find any '0000' when i vim the log file by setting :%!xxd

image

Your Environment

  • Version used: 3.1.5
  • Configuration:
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterFluentBitConfig
metadata:
  name: fluent-bit-slurm-config
  namespace: fluent
  labels:
    name: slurm-log
    component: studio
spec:
  service:
    #logLevel: trace         # off, error, warn, info, debug, and trace
    parsersFile: parsers.conf
    httpServer: true
    flushSeconds: 1         # Interval to flush output
    storage:
      backlogMemLimit: 5M  
      path: /fluent-bit/tail/slurm.storage.backlog
      sync: normal      
      maxChunksUp: 128    
  inputSelector:
    matchLabels:
      name: slurm-log
      component: studio
  parserSelector:
    matchLabels:
      name: slurm-log
      component: studio
  filterSelector:
    matchLabels:
      name: slurm-log
      component: studio
  outputSelector:
    matchLabels:
      name: slurm-log
     component: studio
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: FluentBit
metadata:
  name: fluent-slurm
  namespace: fluent
  labels:
    name: slurm-log
    component: studio
spec:
  image: repo-addr/fluent/fluent-operator/fluent-bit:3.1.5
  ports: []
  resources:
    requests:
      cpu: 500m
      memory: 250Mi
    limits:
      cpu: "2"
      memory: 4Gi
  fluentBitConfigName: fluent-bit-slurm-config
  nodeSelector:           
    kubernetes.io/hostname: 10.31.19.40              # run in one node
  tolerations:
    - operator: Exists
  volumesMounts:
    - mountPath: /slurm
      name: logs
    - mountPath: /etc/localtime
      name: localtime
  positionDB:
    hostPath:
      path: /path/fluent-bit/     # tail插件db记录保存目录,根据实际情况修改
  volumes:
    - hostPath:
        path: /path/logs/hpc # 日志主机路径,根据实际情况修改
      name: logs
    - hostPath:
        path: /etc/localtime
      name: localtime
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes
  • Server type and version:
# kubectl  version
Client Version: v1.28.4-r0-28.0.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.6-r0-28.0.39.7
  • Operating System and version: Huawei Cloud EulerOS 2.0 (x86_64)
  • Filters and plugins:
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterInput
metadata:
  name: slurm
  namespace: fluent
  labels:
    name: slurm-log
    component: studio
spec:
  logLevel: debug  # off;error;warning;info;debug;trace
  tail:
    tag: "slurm.*"
    path: "/slurm/hpc-job-*/hpc-job-*/slurm-*"
    pathKey: "logPath"
    key: message
    bufferChunkSize: 32k
    bufferMaxSize: 128k
    multiline: false
    refreshIntervalSeconds: 1   # refreshing the list of watched files in seconds
    skipLongLines: true
    db: /fluent-bit/tail/slurm.db
    dbSync: Normal
    storageType: filesystem
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: slurm-fluent-bit-lua
  labels:
    name: slurm-log
    component: studio
  namespace: fluent
data:
  slurm-log.lua: |
    function extract_fields(tag, timestamp, record)
      local jobId, pod = string.match(tag, "slurm%.slurm%.([^%.]+)%.([^%.]+)%.*")
      if jobId and pod then
        record["job-id"] = jobId
        record["pod"] = pod
        return 2, timestamp, record
      end
      return -1, timestamp, record
    end
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterFilter
metadata:
  name: slurm
  namespace: fluent
  labels:
    name: slurm-log
    component: studio
spec:
  matchRegex: "slurm.*"
  filters:
    - lua:
        script:
          key: slurm-log.lua
          name: slurm-fluent-bit-lua
        call: extract_fields
        timeAsTable: true
---

Additional context

FeiYing9 avatar Dec 27 '24 03:12 FeiYing9

I delete it from db, remove the storage backlog, and rename the logfile make it mismatch, it's going ok after restart fluent-bit.

Then i rename this logfile back, everything goes ok.

FeiYing9 avatar Dec 27 '24 05:12 FeiYing9

Now I found it's crash when flush to kafka:

Image

I know how to Reproduce:

  1. run a container or pod which just cat this attached file

crash_when_cat_new.txt

  1. make the fluent-bit to scrape and parse it from cri(containerd) logs
  2. it will crashed ,some logs are like this:

Image

FeiYing9 avatar Jan 24 '25 06:01 FeiYing9

to offer more information, my config of fluent-bit is like this:

apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterInput
metadata:
  name: train
  namespace: fluent
  labels:
    fluentbit.fluent.io/enabled: "true"
    fluentbit.fluent.io/component: logging
spec:
  logLevel: debug # off;error;warning;info;debug;trace
  tail:
    tag: "train.*"
    path: "/var/log/containers/job-*.log"
    pathKey: "logPath"
    multiline: false
    parser: "cri"
    refreshIntervalSeconds: 5   # refreshing the list of watched files in seconds
    skipLongLines: true
    db: /fluent-bit/tail/train.db
    dbSync: Normal
    storageType: filesystem
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: train-fluent-bit-lua
  labels:
    fluentbit.fluent.io/enabled: "true"
    fluentbit.fluent.io/component: logging
  namespace: fluent
data:
  train-log.lua: |
    function filterLog(tag, timestamp, record)
      local pod, namespace = string.match(tag, "train%.var%.log%.containers%.([^_]+)_([^_]+)_*")
      if pod and namespace then
        record["namespace"] = namespace
        record["pod"] = pod
        -- remove
        record["logtag"] = nil
        return 2, timestamp, record
      else
        -- If code equals -1, means that the record will be dropped. 
        -- If code equals 0, the record will not be modified, 
        -- otherwise if code equals 1, means the original timestamp and record have been modified so it must be replaced by the returned values from timestamp (second return value) and record (third return value). 
        -- If code equals 2, means the original timestamp is not modified and the record has been modified so it must be replaced by the returned values from record (third return value)
        return -1, timestamp, record
      end
    end
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterFilter
metadata:
  name: train
  namespace: fluent
  labels:
    fluentbit.fluent.io/enabled: "true"
    fluentbit.fluent.io/component: logging
spec:
  matchRegex: "train.*"
  #logLevel: debug # off;error;warning;info;debug;trace
  filters:
    - lua:
        script:
          key: train-log.lua
          name: train-fluent-bit-lua
        call: filterLog
        timeAsTable: true
    - modify:
        rules:
          - add:
              node: ${HOST_IP}
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterOutput
metadata:
  name: train-stdout
  namespace: fluent
  labels:
    fluentbit.fluent.io/enabled: "true"
    fluentbit.fluent.io/component: logging
spec:
  matchRegex: "train.*"
  stdout:
    format: json_lines
    jsonDateFormat: iso8601
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterOutput
metadata:
  name: train-kafka
  namespace: fluent
  labels:
    fluentbit.fluent.io/enabled: "true"
    fluentbit.fluent.io/component: logging
spec:
  matchRegex: "train.*"
  kafka:
    brokers: kafka-controller-0.kafka-controller-headless.kafka.svc.cluster.local:9092,kafka-controller-1.kafka-controller-headless.kafka.svc.cluster.local:9092,kafka-controller-2.kafka-controller-headless.kafka.svc.cluster.local:9092
    rdkafka:
      sasl.mechanisms: SCRAM-SHA-512
      sasl.password: xxx
      sasl.username: xxx
      security.protocol: sasl_plaintext
    timestampFormat: iso8601
    topics: train-logs

the kafka cluster runs well:

$ kubectl  -n kafka get po
kafka-controller-0                         2/2     Running                  0               28d
kafka-controller-1                         2/2     Running                  0               28d
kafka-controller-2                         2/2     Running                  0               28d

the pod which print this file:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: job-exception
  name: job-exception
spec:
  replicas: 1
  selector:
    matchLabels:
      app: job-stdout
  template:
    metadata:
      labels:
        app: job-stdout
    spec:
      terminationGracePeriodSeconds: 1
      containers:
        - command:
            - /bin/bash
            - -c
            - |
              cat /test/crash_when_cat_new.txt
              sleep infinity
          image:xxx:1.0.0
          name: job-stdout
          volumeMounts:
            - mountPath: /test
              name: logs
      volumes:
          name: timezone
        - hostPath:
            path: /tmp      # /tmp/crash_when_cat_new.txt
            type: ""
          name: logs

FeiYing9 avatar Jan 24 '25 07:01 FeiYing9

I have provided the steps for reproducing the issue, anyone can deal with it?

FeiYing9 avatar Feb 05 '25 02:02 FeiYing9

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Jul 14 '25 02:07 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Jul 20 '25 02:07 github-actions[bot]