vector Uptick in dropped events from disk buffer InvalidProtobufPayload errors

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Since upgrading to 0.31 (from 0.29 in the case of this instance), there has been a marked uptick in dropped events to my splunk_hec_logs sink, which is backed by a disk buffer. The error indicates the events are dropped due to an InvalidProtobufPayload error reading from disk.

2023-08-01T13:18:46.116488Z ERROR sink{component_kind="sink" component_id=staging_splunk_hec component_type=splunk_hec_logs component_name=staging_splunk_hec}: vector_buffers::internal_events: Error encountered during buffer read. error=failed to decoded record: InvalidProtobufPayload error_code="decode_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true

I can't find it at the moment, but seem to remember another issue or discussion where the underlying protobuf library was now implementing a 4MB size limit and potentially truncating messages larger than that. Maybe that is also related?

Configuration

data_dir: /vector-data-dir
acknowledgements:
  enabled: true
api:
  enabled: true
  address: 127.0.0.1:8686
  playground: false
sources:
  kafka_in:
    type: kafka
    bootstrap_servers: kafka-kafka-bootstrap.kafka:9093
    group_id: '${KAFKA_CONSUMER_GROUP_ID}'
    topics:
      - ^[^_].+
    librdkafka_options:
      "topic.blacklist": "^strimzi.+"
    decoding:
      codec: json
    sasl:
      enabled: true
      mechanism: SCRAM-SHA-512
      username: '${KAFKA_CONSUMER_USERNAME}'
      password: '${KAFKA_CONSUMER_PASSWORD}'
transforms:
  msg_router:
    type: route
    inputs:
      - kafka_in
    route:
      staging: includes(array!(.destinations), "staging")
      # a few other routes
  staging_filter:
    type: filter
    inputs:
      - msg_router.staging
    condition: .vector_metadata.exclude != true
  staging_throttler:
    type: sample
    inputs:
      - staging_filter
    rate: 20 # 5%
  staging_metadata:
    type: remap
    inputs:
      - staging_throttler
    source: |-
      .host = .vector_metadata.node
      if exists(.vector_metadata.host) {
        .host = .vector_metadata.host
      }
      .splunk.metadata.index = .vector_metadata.index
      .splunk.metadata.source = .vector_metadata.source
      .splunk.metadata.sourcetype = .vector_metadata.sourcetype
sinks:
  staging_splunk_hec:
    type: splunk_hec_logs
    inputs:
      - staging_metadata
    endpoint: https://hec.splunk.staging:8088
    default_token: '${STAGING_HEC_TOKEN}'
    encoding:
      codec: text
    index: '{{ splunk.metadata.index }}'
    source: '{{ splunk.metadata.source }}'
    sourcetype: '{{ splunk.metadata.sourcetype }}'
    acknowledgements:
      query_interval: 30
      retry_limit: 60
    request:
      timeout_secs: 1200
      retry_max_duration_secs: 300
      concurrency: adaptive
    buffer:
      type: disk
      max_size: 5368709120 # 5Gi

Version

vector 0.31.0 (x86_64-unknown-linux-gnu 0f13b22 2023-07-06 13:52:34.591204470)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

Aug 01 '23 14:08 sbalmos

Hi @sbalmos , that could be related to protobuf request size limits. Would you be able to share your config to help get a sense of what might be going on?

Aug 02 '23 16:08 neuronull

For this instance, which is mainly a message router to different destinations, it's not that interesting. I've updated the Configuration section of the original post.

Aug 02 '23 18:08 sbalmos

Found the original issue referencing the tonic lib change creating a 4MB limit. #17926 . Just linking it here for reference, not sure if it's all related.

Aug 03 '23 21:08 sbalmos

Can you share a sample input that I can use in my local kafka producer in order to trigger this error?

Aug 08 '23 14:08 pront

I do not have one, since I can't backtrack what input causes the error and thus ends up being dropped by the buffer / sink. I would have to surmise you could possibly trigger it by creating a massive input event of some sort, something that will definitely end up being over 4MB in size whether encoded in protobuf or json.

Aug 08 '23 22:08 sbalmos

I think that 4 MB limit was only added for the tonic gRPC server for decoding incoming requests, not for protobuf decoding generally 🤔

@sbalmos it sounds like you are seeing more of these errors in 0.31.0 vs 0.29.0? If so, I'm wondering if we could try to bisect down to identify a specific commit that causes the issue.

Aug 10 '23 16:08 jszwedko

Looking at this again, it sounds like it could be the case that events written to a buffer by v0.29.0 couldn't be read by v0.31.0. It'd be worth trying that as a stand-alone test case.

Sep 14 '23 14:09 jszwedko

Gah, haven't really had time yet to get back to this, totally my fault. We've since gone all-0.32.1 and the issue's still present, so it's not a disk buffer format backwards incompatibility. I've got it on my todo list to trace back the usages of InvalidProtobufPayload to see if anything might make sense.

Sep 14 '23 16:09 sbalmos

If you're still seeing the problem on fresh installs of 0.32, with fresh disk buffers, then I would agree that it's not related to compatibility issues of the disk buffer files between versions.

Technically speaking, the error is most commonly expected due to something like a compatibility issue (mismatch in the data vs the Protocol Buffers definition, etc) but it can also be triggered purely from the perspective of "is this even valid Protocol Buffers data at all?" It sounds like you're experiencing the latter.

If there's any more information you can share at all, it would be appreciated/helpful. Stuff like the number of events it reports as dropping in a single go (which would be a proxy to understanding the record size in the buffer itself), the size of input events if you're able to roughly calculate that, etc.

Sep 14 '23 17:09 tobz

I just hit this error on multiple of my vector instances after restart which then fails to start:

2024-05-24T09:09:52.916943Z ERROR vector::topology::builder: Configuration error. error=Sink "out_vector_high": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload

Is there a way how to at least workaround this? Eg. to determine broken buffer and delete it. I tried vector validate but it pass fine.

Best thing I got is this:

timeout 10 vector --graceful-shutdown-limit-secs 1 -t 1; [ $? -eq 124 ] || rm -rf /var/lib/vector/buffer

Boot being stucked due to buffer corruption is worse case scenario. In my environment it is better to wipe out broken buffer than keep failing.

May 24 '24 09:05 fpytloun

Another workaround is to add this into entrypoint before starting vector process:

  now=$(date +%s)
  last_startup=$(cat /var/lib/vector/startup || echo 0)
  last_startup_age=$[ ($now - $last_startup) / 60 ]

  if find /var/lib/vector/buffer -type f -name buffer.db -mmin +$last_startup_age | grep buffer.db >/dev/null; then
    log_error "Cleaning vector buffer as it was not updated since last startup ($last_startup_age minutes) to fix startup in case of buffer corruption"
    rm -rf /var/lib/vector/buffer
  fi

  echo "$now" > /var/lib/vector/startup

It will simply delete buffer directory if buffer.db was not modified since last startup.

May 31 '24 12:05 fpytloun

Also having the same issue here, not sure if there is any update solutions? Our setup is pretty simple vector deployed as a sidecar with the pod that writes to a PVC and vector parses and sends this to s3. We have pretty large log lines some being many MBs in size. I constantly get the following errors

Vector version: 0.46.1

2025-05-13T15:21:10.509291Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_buffers::internal_events: Error encountered during buffer read. error=failed to decoded record: InvalidProtobufPayload error_code="decode_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true

Which then leads to

2025-05-13T15:23:44.380520Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:sink{buffer_type="disk"}: vector_buffers::internal_events: Events dropped. count=3 intentional=false reason=corrupted_events stage=0

Lastly after a while I end up with my pods not being able to start the same as @fpytloun with

2024-05-24T09:09:52.916943Z ERROR vector::topology::builder: Configuration error. error=Sink "out_vector_high": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload

My vector config looks like

    api:
      address: 127.0.0.1:8686
      enabled: true
      playground: false
    data_dir: /vector-data-dir
    sinks:
      dropped_files:
        type: file
        inputs:
        - '*.dropped'
        idle_timeout_secs: 30
        path: /logdir/dropped/vector-%Y-%m-%d.log
        timezone: local
        encoding:
          codec: json
      s3:
        acknowledgements:
          enabled: true
        bucket: my-bucket
        buffer:
        - max_size: 10843548800
          type: disk
        encoding:
          codec: json
          except_fields:
          - date
          - env
          - service
          - system
        framing:
          method: newline_delimited
        batch:
          max_bytes: 40240000
          timeout_secs: 60
        inputs:
        - json_parser
        key_prefix: raw/{{ scope }}/{{ service }}/{{ contenttypenamespace }}.{{ contenttypename }}/environment={{ env }}/date=%F/
        type: aws_s3
      prometheus:
        type: prometheus_exporter
        inputs:
        - filter_metrics
    transforms:
      filter_metrics:
        type: filter
        inputs:
          - metrics
        condition:
          type: vrl
          source: |
            .name == "component_errors_total" || (includes(["buffer_byte_size", "buffer_discarded_events_total", "buffer_events", "buffer_received_event_bytes_total", "buffer_sent_event_bytes_total", "buffer_sent_events_total", "component_received_events_total", "component_discarded_events_total"], .name) && !includes(["metrics", "filter_metrics", "prometheus"], .tags.component_id))
      json_parser:
        type: remap
        drop_on_error: true
        reroute_dropped: true
        inputs:
        - tail
        source: |
          . = parse_json!(.message)

          contentRecord = parse_json!(.properties.contentRecord)
          
          .message = ""
          .content = contentRecord.content
          .timestamp = .properties.timestamp
          .contenttypename = .properties.contentTypeName
          .contenttypenamespace = .properties.contentTypeNamespace
          .labels = contentRecord.labels
          .scope = "app"
          .env = "prod"
          .service = "app"
          del(.loglevel)
          del(.logger)
          del(.@cbtimestamp)
          del(.@timestamp)
          del(.properties)
    sources:
      metrics:
       type: internal_metrics
      tail:
        type: file
        data_dir: /vector-data-dir
        include:
        - "/logdir/*.clog"
        max_line_bytes: 40240000
        fingerprint:
          strategy: device_and_inode

I don't know of any way of knowing which event is causing this as we have many hundreds per second and the only fix when it can't start anymore is to delete the buffer file, had this crash a few times and it was just the one every time I deleted.

May 13 '25 15:05 trahim