Uptick in dropped events from disk buffer InvalidProtobufPayload errors
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Since upgrading to 0.31 (from 0.29 in the case of this instance), there has been a marked uptick in dropped events to my splunk_hec_logs sink, which is backed by a disk buffer. The error indicates the events are dropped due to an InvalidProtobufPayload error reading from disk.
2023-08-01T13:18:46.116488Z ERROR sink{component_kind="sink" component_id=staging_splunk_hec component_type=splunk_hec_logs component_name=staging_splunk_hec}: vector_buffers::internal_events: Error encountered during buffer read. error=failed to decoded record: InvalidProtobufPayload error_code="decode_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
I can't find it at the moment, but seem to remember another issue or discussion where the underlying protobuf library was now implementing a 4MB size limit and potentially truncating messages larger than that. Maybe that is also related?
Configuration
data_dir: /vector-data-dir
acknowledgements:
enabled: true
api:
enabled: true
address: 127.0.0.1:8686
playground: false
sources:
kafka_in:
type: kafka
bootstrap_servers: kafka-kafka-bootstrap.kafka:9093
group_id: '${KAFKA_CONSUMER_GROUP_ID}'
topics:
- ^[^_].+
librdkafka_options:
"topic.blacklist": "^strimzi.+"
decoding:
codec: json
sasl:
enabled: true
mechanism: SCRAM-SHA-512
username: '${KAFKA_CONSUMER_USERNAME}'
password: '${KAFKA_CONSUMER_PASSWORD}'
transforms:
msg_router:
type: route
inputs:
- kafka_in
route:
staging: includes(array!(.destinations), "staging")
# a few other routes
staging_filter:
type: filter
inputs:
- msg_router.staging
condition: .vector_metadata.exclude != true
staging_throttler:
type: sample
inputs:
- staging_filter
rate: 20 # 5%
staging_metadata:
type: remap
inputs:
- staging_throttler
source: |-
.host = .vector_metadata.node
if exists(.vector_metadata.host) {
.host = .vector_metadata.host
}
.splunk.metadata.index = .vector_metadata.index
.splunk.metadata.source = .vector_metadata.source
.splunk.metadata.sourcetype = .vector_metadata.sourcetype
sinks:
staging_splunk_hec:
type: splunk_hec_logs
inputs:
- staging_metadata
endpoint: https://hec.splunk.staging:8088
default_token: '${STAGING_HEC_TOKEN}'
encoding:
codec: text
index: '{{ splunk.metadata.index }}'
source: '{{ splunk.metadata.source }}'
sourcetype: '{{ splunk.metadata.sourcetype }}'
acknowledgements:
query_interval: 30
retry_limit: 60
request:
timeout_secs: 1200
retry_max_duration_secs: 300
concurrency: adaptive
buffer:
type: disk
max_size: 5368709120 # 5Gi
Version
vector 0.31.0 (x86_64-unknown-linux-gnu 0f13b22 2023-07-06 13:52:34.591204470)
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
Hi @sbalmos , that could be related to protobuf request size limits. Would you be able to share your config to help get a sense of what might be going on?
For this instance, which is mainly a message router to different destinations, it's not that interesting. I've updated the Configuration section of the original post.
Found the original issue referencing the tonic lib change creating a 4MB limit. #17926 . Just linking it here for reference, not sure if it's all related.
Can you share a sample input that I can use in my local kafka producer in order to trigger this error?
I do not have one, since I can't backtrack what input causes the error and thus ends up being dropped by the buffer / sink. I would have to surmise you could possibly trigger it by creating a massive input event of some sort, something that will definitely end up being over 4MB in size whether encoded in protobuf or json.
I think that 4 MB limit was only added for the tonic gRPC server for decoding incoming requests, not for protobuf decoding generally 🤔
@sbalmos it sounds like you are seeing more of these errors in 0.31.0 vs 0.29.0? If so, I'm wondering if we could try to bisect down to identify a specific commit that causes the issue.
Looking at this again, it sounds like it could be the case that events written to a buffer by v0.29.0 couldn't be read by v0.31.0. It'd be worth trying that as a stand-alone test case.
Gah, haven't really had time yet to get back to this, totally my fault. We've since gone all-0.32.1 and the issue's still present, so it's not a disk buffer format backwards incompatibility. I've got it on my todo list to trace back the usages of InvalidProtobufPayload to see if anything might make sense.
If you're still seeing the problem on fresh installs of 0.32, with fresh disk buffers, then I would agree that it's not related to compatibility issues of the disk buffer files between versions.
Technically speaking, the error is most commonly expected due to something like a compatibility issue (mismatch in the data vs the Protocol Buffers definition, etc) but it can also be triggered purely from the perspective of "is this even valid Protocol Buffers data at all?" It sounds like you're experiencing the latter.
If there's any more information you can share at all, it would be appreciated/helpful. Stuff like the number of events it reports as dropping in a single go (which would be a proxy to understanding the record size in the buffer itself), the size of input events if you're able to roughly calculate that, etc.
I just hit this error on multiple of my vector instances after restart which then fails to start:
2024-05-24T09:09:52.916943Z ERROR vector::topology::builder: Configuration error. error=Sink "out_vector_high": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload
Is there a way how to at least workaround this? Eg. to determine broken buffer and delete it. I tried vector validate but it pass fine.
Best thing I got is this:
timeout 10 vector --graceful-shutdown-limit-secs 1 -t 1; [ $? -eq 124 ] || rm -rf /var/lib/vector/buffer
Boot being stucked due to buffer corruption is worse case scenario. In my environment it is better to wipe out broken buffer than keep failing.
Another workaround is to add this into entrypoint before starting vector process:
now=$(date +%s)
last_startup=$(cat /var/lib/vector/startup || echo 0)
last_startup_age=$[ ($now - $last_startup) / 60 ]
if find /var/lib/vector/buffer -type f -name buffer.db -mmin +$last_startup_age | grep buffer.db >/dev/null; then
log_error "Cleaning vector buffer as it was not updated since last startup ($last_startup_age minutes) to fix startup in case of buffer corruption"
rm -rf /var/lib/vector/buffer
fi
echo "$now" > /var/lib/vector/startup
It will simply delete buffer directory if buffer.db was not modified since last startup.
Also having the same issue here, not sure if there is any update solutions? Our setup is pretty simple vector deployed as a sidecar with the pod that writes to a PVC and vector parses and sends this to s3. We have pretty large log lines some being many MBs in size. I constantly get the following errors
Vector version: 0.46.1
2025-05-13T15:21:10.509291Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_buffers::internal_events: Error encountered during buffer read. error=failed to decoded record: InvalidProtobufPayload error_code="decode_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
Which then leads to
2025-05-13T15:23:44.380520Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:sink{buffer_type="disk"}: vector_buffers::internal_events: Events dropped. count=3 intentional=false reason=corrupted_events stage=0
Lastly after a while I end up with my pods not being able to start the same as @fpytloun with
2024-05-24T09:09:52.916943Z ERROR vector::topology::builder: Configuration error. error=Sink "out_vector_high": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload
My vector config looks like
api:
address: 127.0.0.1:8686
enabled: true
playground: false
data_dir: /vector-data-dir
sinks:
dropped_files:
type: file
inputs:
- '*.dropped'
idle_timeout_secs: 30
path: /logdir/dropped/vector-%Y-%m-%d.log
timezone: local
encoding:
codec: json
s3:
acknowledgements:
enabled: true
bucket: my-bucket
buffer:
- max_size: 10843548800
type: disk
encoding:
codec: json
except_fields:
- date
- env
- service
- system
framing:
method: newline_delimited
batch:
max_bytes: 40240000
timeout_secs: 60
inputs:
- json_parser
key_prefix: raw/{{ scope }}/{{ service }}/{{ contenttypenamespace }}.{{ contenttypename }}/environment={{ env }}/date=%F/
type: aws_s3
prometheus:
type: prometheus_exporter
inputs:
- filter_metrics
transforms:
filter_metrics:
type: filter
inputs:
- metrics
condition:
type: vrl
source: |
.name == "component_errors_total" || (includes(["buffer_byte_size", "buffer_discarded_events_total", "buffer_events", "buffer_received_event_bytes_total", "buffer_sent_event_bytes_total", "buffer_sent_events_total", "component_received_events_total", "component_discarded_events_total"], .name) && !includes(["metrics", "filter_metrics", "prometheus"], .tags.component_id))
json_parser:
type: remap
drop_on_error: true
reroute_dropped: true
inputs:
- tail
source: |
. = parse_json!(.message)
contentRecord = parse_json!(.properties.contentRecord)
.message = ""
.content = contentRecord.content
.timestamp = .properties.timestamp
.contenttypename = .properties.contentTypeName
.contenttypenamespace = .properties.contentTypeNamespace
.labels = contentRecord.labels
.scope = "app"
.env = "prod"
.service = "app"
del(.loglevel)
del(.logger)
del(.@cbtimestamp)
del(.@timestamp)
del(.properties)
sources:
metrics:
type: internal_metrics
tail:
type: file
data_dir: /vector-data-dir
include:
- "/logdir/*.clog"
max_line_bytes: 40240000
fingerprint:
strategy: device_and_inode
I don't know of any way of knowing which event is causing this as we have many hundreds per second and the only fix when it can't start anymore is to delete the buffer file, had this crash a few times and it was just the one every time I deleted.