An error occurred that Vector couldn’t handle: failed to encode record: BufferTooSmall when using http server source causing the vector to stop ingesting data
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
We had a similar issue when using the Splunk HEC source and raised a bug report #17670, We started using the HTTP source and are now seeing a buffer error with the HTTP source causing the vector to stop ingesting any new data the containers are entering a restart loop with OOM error on Kubernetes and on vector we are seeing the below error.
K8's container
team-ingest-prod-0 0/1 OOMKilled 1 1m 19x.x.x.x vm-ch2-c <none>
Vector Error:
2023-08-10T11:48:04.553373Z INFO vector::app: Log level is enabled. level=“vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,lapin=info,kube=info”
2023-08-10T11:48:04.554149Z INFO vector::app: Loading configs. paths=[“/etc/vector”]
2023-08-10T11:48:06.390284Z WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `endpoint`. Please use `endpoints` option instead.
2023-08-10T11:48:06.390423Z WARN vector_core::tls::settings: The `verify_certificate` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:06.390436Z WARN vector_core::tls::settings: The `verify_hostname` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:06.437918Z WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `suppress_type_name`. Please use `api_version` option instead.
2023-08-10T11:48:07.994646Z WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `endpoint`. Please use `endpoints` option instead.
2023-08-10T11:48:07.994726Z WARN vector_core::tls::settings: The `verify_certificate` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:07.994738Z WARN vector_core::tls::settings: The `verify_hostname` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:08.039273Z WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `suppress_type_name`. Please use `api_version` option instead.
2023-08-10T11:48:08.107306Z INFO vector::topology::runner: Running healthchecks.
2023-08-10T11:48:08.107388Z INFO vector::topology::builder: Healthcheck passed.
2023-08-10T11:48:08.107538Z INFO vector::topology::builder: Healthcheck disabled.
2023-08-10T11:48:08.107558Z INFO source{component_kind=“source” component_id=o11y_ingest_http component_type=http_server component_name=o11y_ingest_http}: vector::sources::util::http::prelude: Building HTTP server. address=0.0.0.0:8081
2023-08-10T11:48:08.107659Z INFO vector: Vector has started. debug=“false” version=“0.31.0” arch=“x86_64" revision=“38c3f0b 2023-05-22 17:38:48.655488673”
2023-08-10T11:48:08.107867Z INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:8181
2023-08-10T11:48:08.114700Z INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=off
2023-08-10T11:48:08.122775Z INFO vector::topology::builder: Healthcheck passed.
2023-08-10T11:48:08.125420Z INFO vector::topology::builder: Healthcheck passed.
2023-08-10T11:48:43.659271Z ERROR transform{component_kind=“transform” component_id=remap component_type=remap component_name=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2023-08-10T11:48:43.789150Z ERROR transform{component_kind=“transform” component_id=remap component_type=remap component_name=remap}: vector::topology: An error occurred that Vector couldn’t handle: failed to encode record: BufferTooSmall.
2023-08-10T11:48:43.789228Z INFO vector: Vector has stopped.
2023-08-10T11:48:43.801112Z INFO vector::topology::runner: Shutting down... Waiting on running components. remaining_components=“o11y_elastic_internal, metrics, vector, prom_exporter, o11y_ingest_http, o11y_maas_remote_write, internal_logs, remap_un_nest, o11y_ingest, o11y_elastic, remap, internal_metrics” time_remaining=“59 seconds left”
2023-08-10T11:48:48.789690Z INFO vector::topology::runner: Shutting down... Waiting on running components. remaining_components=“o11y_elastic_internal, remap_un_nest” time_remaining=“54 seconds left”
2023-08-10T11:48:53.790245Z INFO vector::topology::runner: Shutting down... Waiting on running components. remaining_components=“remap_un_nest” time_remaining=“49 seconds left”
Configuration
data_dir = "/vector-data-dir"
[log_schema]
host_key = "host.hostname"
timestamp_key = "@timestamp"
[api]
enabled = true
address = "127.0.0.1:8686"
playground = false
[sources.internal_metrics]
type = "internal_metrics"
[sources.o11y_ingest]
address = "0.0.0.0:8080"
valid_tokens = [ "<token>" ]
type = "splunk_hec"
[sources.o11y_ingest_http]
address = "0.0.0.0:8081"
type = "http_server"
"framing.method" = "character_delimited"
"framing.character_delimited.delimiter" = ","
"framing.newline_delimited.max_length" = 8_192
"decoding.codec" = "json"
"encoding.codec" = "json"
[sources.internal_logs]
type = "internal_logs"
[transforms.remap_un_nest]
type = "remap"
inputs = [ "o11y_ingest", "o11y_ingest_http" ]
source = """
.message = parse_json!(.message)
. = unnest!(.message)"""
[transforms.remap]
type = "remap"
inputs = [ "remap_un_nest" ]
source = """
. = merge!(., .message)
parsed_json, err = parse_json(string!(.message))
if err == null {
., err = merge(., parsed_json)
del(.message)
} else {
del(.source_type)
del(.message)
}
del(.source_type)
del(.path)"""
[transforms.metrics]
type = "remap"
inputs = [ "internal_metrics" ]
source = """
.tags.job = "team-ingest-prod"
.tags.tenant_name = "team"
.tags.instance = "ingest-prod-ch2.example.com:443"
.tags.environment = "prod"
.tags.dc = "ch2"\
"""
[transforms.vector]
type = "remap"
inputs = [ "internal_logs" ]
source = """
.vector = del(.)
."@timestamp" = del(.vector.timestamp)
.message = del(.vector.message)
.log = {}
.log.logger = del(.vector.metadata.module_path)
.log.level = downcase(string!(del(.vector.metadata.level)))
.ecs = {}
.ecs.version = "1.7"
.cse = {}
.cse.version = "1.0"
.agent = {}
.agent.type = "vector"
.agent.version = "0.31.0"
.event = {}
.event.dataset = ["app"]
.host = {}
.host.hostname = get_env_var("ELK_HOSTNAME") ?? null
.cloud = {}
.cloud.provider = get_env_var("CLOUD_PROVIDER") ?? null
.cloud.region = get_env_var("CLOUD_REGION") ?? null
.tenant = {}
.tenant.name = "team_prod"
.service = {}
.service.name = "team_prod"
.host.o11y_hostname = get_hostname!()"""
[sinks.prom_exporter]
type = "prometheus_exporter"
inputs = [ "internal_metrics" ]
address = "0.0.0.0:8181"
[sinks.o11y_maas_remote_write]
type = "prometheus_remote_write"
inputs = [ "metrics" ]
default_namespace = "team"
endpoint = "https://<prometheus>.example.com:9090/api/v1/write"
tenant_id = "team_telemetry"
[sinks.o11y_maas_remote_write.healthcheck]
enabled = false
[sinks.o11y_elastic]
type = "elasticsearch"
inputs = [ "remap" ]
endpoint = "${ELK_URL}"
compression = "gzip"
suppress_type_name = true
[sinks.o11y_elastic.bulk]
index = "logz-team-app"
action = "create"
[sinks.o11y_elastic.auth]
strategy = "basic"
user = "${ELK_USERNAME}"
password = "${ELK_PASSWORD}"
[sinks.o11y_elastic.buffer]
type = "disk"
max_size = 900_000_000_000
when_full = "drop_newest"
[sinks.o11y_elastic.batch]
max_bytes = 8_300_000
timeout_secs = 8
[sinks.o11y_elastic.request]
concurrency = "adaptive"
timeout_secs = 120
[sinks.o11y_elastic.tls]
verify_certificate = false
verify_hostname = false
[sinks.o11y_elastic_internal]
type = "elasticsearch"
inputs = [ "vector" ]
compression = "gzip"
endpoint = "${ELK_URL}"
suppress_type_name = true
[sinks.o11y_elastic_internal.bulk]
index = "logz-forwarder"
action = "create"
[sinks.o11y_elastic_internal.auth]
strategy = "basic"
user = "${ELK_USERNAME}"
password = "${ELK_PASSWORD}"
[sinks.o11y_elastic_internal.buffer]
type = "disk"
max_size = 400_000_000
when_full = "drop_newest"
[sinks.o11y_elastic_internal.batch]
max_bytes = 8_300_000
timeout_secs = 8
[sinks.o11y_elastic_internal.request]
retry_attempts = 10
[sinks.o11y_elastic_internal.tls]
verify_certificate = false
verify_hostname = false
Version
0.31.x
Debug Output
No response
Example Data
No response
Additional Context
Vector is running in Kubernetes and this specific client has large payloads with close to 6000-8000 fields in their entire dataset.
References
#17670 Faced similar issue with splunk hec source we moved to http source and are seeing a new error.
Thanks for this report @raghu999 ! Did you share the right logs though? The error logs there seem to be showing an error with the disk buffers. Do you have logs showing the panic in the http_server source?
@jszwedko Updated the issue we are only seeing the buffer error but it is the same dataset that we were seeing an error when using the Splunk HEC source and when we switched to using the HTTP server source we now are seeing a new error that is related to buffers. We confirmed that the disk buffer is not full when this issue happens.
Gotcha, thanks for clarifying @raghu999 ! It sounds more like a bug in the disk buffer implementation than the http source then.
This error comes from our limitations around the maximum record size allowed in a disk buffer, which is set to 128MB. A record is essentially a chunk of events written all together, which based on the sources used in the given configuration, would be any events decoded from a single request.
This would imply that if the error above is being hit, that a single request sent to either the splunk_hec or http_server source is very large, something like at least 90-100MB: each decoded event will have some internal overhead, so the request doesn't need to be 128MB by itself.
Based on your knowledge of the clients sending to your Vector instance, is it possible for requests to be that large (90-100MB)?
Hi everyone! Facing the same issue here:
Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.412419Z ERROR transform{component_kind="transform" component_id=parsing_transform_new component_type=remap component_name=parsing_transform_new}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.414728Z ERROR transform{component_kind="transform" component_id=parsing_transform_new component_type=remap component_name=parsing_transform_new}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.415193Z INFO vector: Vector has stopped.
This is the transform:
parsing_transform_new:
type: remap
inputs:
- profile_route.parsing
source: |-
if (springboot_match, err = parse_regex(.message, r'^(?P<timestamp>\d+(?:-\d+){2}\s+\d+(?::\d+){2}\.\d+)\s+(?P<log_level>\S+)\s+(?P<pid>\d+)\s+---\s+\[(?P<thread_name>[\s+\S]*?)\]\s+(?P<logger_name>\S+)\s+:\s+(?s)(?P<message>.*)'); err == null) { # java_springboot
.timestamp = parse_timestamp!(springboot_match.timestamp, "%F %T.%3f")
.log_level = downcase!(springboot_match.log_level)
.pid = to_int!(springboot_match.pid)
.thread_name = strip_whitespace!(springboot_match.thread_name)
.logger_name = springboot_match.logger_name
.message = to_string(springboot_match.message)
} else if (python_match, err = parse_regex(.message, r'^\[(?P<timestamp>\d+-\d+-\d+\s\d+:\d+:\d+,\d+)\s(?P<log_level>\w+)\s(?P<logger_name>\S+)\s(?P<thread_name>\w+\S+\s\S+)]\s(?s)(?P<message>.*)'); err == null) { # python
.timestamp = parse_timestamp!(python_match.timestamp, "%F %T,%3f")
.log_level = downcase!(python_match.log_level)
.logger_name = python_match.logger_name
.thread_name = strip_whitespace!(python_match.thread_name)
.message = to_string(python_match.message)
} else if (vault_match, err = parse_regex(.message, r'^(?P<timestamp>\d+-\d+-\d+T\d+:\d+:\d+.\d+Z)\s\[(?P<log_level>\w+)]\s+(?s)(?P<message>.*)$'); err == null) { # vault
.timestamp = vault_match.timestamp
.log_level = downcase!(vault_match.log_level)
.message = to_string(vault_match.message)
} else {
.malformed = true
}
In my case it is unlikely for requests to be even close to 90-100mb. v0.32.0
We're seeing this error again with a different application. Some relevant log lines
vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2023-11-15T23:45:07.499790Z DEBUG transform{component_kind="transform" component_id=remap component_type=remap}: vector::topology::builder: Synchronous transform finished with an error.
2023-11-15T23:45:07.499819Z ERROR transform{component_kind="transform" component_id=remap component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2023-11-15T23:45:07.499874Z INFO vector: Vector has stopped.
2023-11-15T23:45:07.499934Z DEBUG vector_buffers::variants::disk_v2::writer: Writer marked as closed.
2023-11-15T23:45:07.500314Z DEBUG source{component_kind="source" component_id=internal_logs component_type=internal_logs}: vector::topology::builder: Source finished normally.
The source is http_server. The symptoms is that memory utilization in the pod would grow to the point of being OOM killed. But given the log above 'source finished normally' vector might just have exited?
version we used: v0.34.0
A few questions:
- So the
remaptransform is also accessing disk buffer while doing the transform? - If transforms also access disk buffers, is there a configuration on the disk buffer size for transforms?
- When we removed the transform our pods would no longer be OOMkilled, so this record limitation is not a record size limitation but a transform limitation?
- Is there a way for vector to print/obtain the event size inside a remap transform?
- What does
Synchronous transform finished with an error.mean (what function call might trigger this error message , or maybe, are all transforms synchronous. what characterizes a "synchronous" transform?)
I am seeing similar issues
2024-02-20T09:21:25.663269Z ERROR transform{component_kind="transform" component_id=kubernetes_application component_type=remap component_name=kubernetes_application}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2024-02-20T09:21:25.663344Z INFO vector: Vector has stopped.
2024-02-20T09:21:25.663365Z ERROR transform{component_kind="transform" component_id=route_logs component_type=route component_name=route_logs}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-02-20T09:21:25.676755Z ERROR transform{component_kind="transform" component_id=containerlogs component_type=remap component_name=containerlogs}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-02-20T09:21:25.680198Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="splunk_kube_audit, fluentbit, kubernetes_application, prometheus_exporter, splunk_container_logs, kubernetes_audit, httpd, vector_metrics, kubernetes_infrastructure, route_logs, journal_logs, containerlogs" time_remaining="59 seconds left"
The containers exists upon certain load , around 10k events/sec.
@awangc have You managed to find the root cause or workaround ? I've seen similar issues while sending logs from upstream vector pods that are part of OpenShift and the default batch size is 10MB I believe. I've been testing with sending just generic nginx logs so really small events. Upon certain load the vector would exit with same error related to the buffer. I have observed the issue not happening when the batch size on upstream vector is set to 1MB (so 10 times less).
@jszwedko is there any doc how are the transforms related to a disk buffers ?
@jszwedko is there any doc how are the transforms related to a disk buffers ?
Transforms emit data to an output channel which is then consumed by buffers. Is that the sort of detail you were looking for? Or do you have different sort of question?
For this issue, I know it is a lot to ask, but if anyone is able to create a standalone reproduction, that would aid with debugging.
I have managed a create a test case to recreate this issue. It seems to be a combination of large events, a transform and disk buffers for a sink.
I created a large json test event with a "message" field also containing json, approx 10Mb. I then just replicated that same event multiple times into 3 different files, each about 500Mb. I can provide the example event if required.
Below is the config to replicate it along with the output of a run.
api:
enabled: true
data_dir: /tmp/vector/data
sources:
events:
type: file
include:
- /tmp/vector/source-events/*.json
ignore_checkpoints: true
fingerprinting:
strategy: device_and_inode
max_line_bytes: 524288000
sinks:
blackhole:
type: blackhole
inputs:
- split_events
buffer:
type: disk
max_size: 1073741824
when_full: block
print_interval_secs: 10
rate: 10
transforms:
split_events:
type: remap
inputs:
- events
source: |-
. = parse_json!(.message)
$ ./vector-0.44.0 -c config.yaml
2025-01-20T23:35:07.768304Z INFO vector::app: Log level is enabled. level="info"
2025-01-20T23:35:07.769396Z INFO vector::app: Loading configs. paths=["config.yaml"]
2025-01-20T23:35:08.043441Z INFO vector::topology::running: Running healthchecks.
2025-01-20T23:35:08.043814Z INFO vector::topology::builder: Healthcheck passed.
2025-01-20T23:35:08.044247Z INFO vector: Vector has started. debug="false" version="0.44.0" arch="x86_64" revision="3cdc7c3 2025-01-13 21:26:04.735691656"
2025-01-20T23:35:08.043928Z INFO source{component_kind="source" component_id=events component_type=file}: vector::sources::file: Starting file server. include=["/tmp/vector/source-events/*.json"] exclude=[]
2025-01-20T23:35:08.045909Z INFO source{component_kind="source" component_id=events component_type=file}:file_server: file_source::checkpointer: Loaded checkpoint data.
2025-01-20T23:35:08.046107Z INFO source{component_kind="source" component_id=events component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/tmp/vector/source-events/large_event.json
2025-01-20T23:35:08.046154Z INFO source{component_kind="source" component_id=events component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/tmp/vector/source-events/large_events.json
2025-01-20T23:35:08.046180Z INFO source{component_kind="source" component_id=events component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/tmp/vector/source-events/large_events_2.json
2025-01-20T23:35:08.046203Z INFO source{component_kind="source" component_id=events component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/tmp/vector/source-events/large_events_3.json
2025-01-20T23:35:08.056236Z INFO vector::sinks::blackhole::sink: Collected events. events=0 raw_bytes_collected=0
2025-01-20T23:35:08.069372Z INFO vector::internal_events::api: API server running. address=[127.0.0.1:8686](http://127.0.0.1:8686/) playground=http://127.0.0.1:8686/playground graphql=http://127.0.0.1:8686/graphql
2025-01-20T23:35:18.817311Z INFO vector::sinks::blackhole::sink: Collected events. events=21 raw_bytes_collected=203819011
2025-01-20T23:35:27.231962Z ERROR transform{component_kind="transform" component_id=split_events component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2025-01-20T23:35:27.235103Z ERROR transform{component_kind="transform" component_id=split_events component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2025-01-20T23:35:27.239027Z INFO vector: Vector has stopped.
2025-01-20T23:35:27.247389Z INFO vector::sinks::blackhole::sink: Collected events. events=21 raw_bytes_collected=203819011
2025-01-20T23:35:27.247922Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="events" time_remaining="59 seconds left"
2025-01-20T23:35:27.642362Z ERROR source{component_kind="source" component_id=events component_type=file}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
@jszwedko are you able to replicate this with what I have provided above?
@jszwedko are you able to replicate this with what I have provided above?
Hey! Thank you so much for providing the reproduction case. That is incredibly helpful. I haven't had the time to dig into this yet, but @pront is going to take it up instead.
Hi @DenverJ, thank you for providing the repo steps. I managed to reproduced this bug locally with your setup.
❯ ls -ltr
total 1542156
-rw-r--r-- 1 pavlos.rontidis 513807973 Feb 25 15:51 file1.json
-rw-r--r-- 1 pavlos.rontidis 513807973 Feb 25 15:51 file2.json
-rw-r--r-- 1 pavlos.rontidis 513807973 Feb 25 15:51 file3.json
The inner error was:
failed to encode Protobuf message; insufficient buffer capacity (required: 817903482, remaining: 134217680)
One more variation of the config for the bug:
sources:
pods_logs:
type: kubernetes_logs
max_line_bytes: 131072
transforms:
pods_source_types:
type: remap
inputs:
- pods_logs
source: |-
.source = "pods_logs"
.source_environment = "some-env"
sinks:
to_kafka:
type: kafka
inputs:
- pods_source_types
bootstrap_servers: ...
compression: zstd
encoding:
codec: "json"
topic: logs-kubernetes-some-env
batch:
max_bytes: 1048576 # 1MB
# The maximum size of a batch before it is flushed.
max_events: 100
# The maximum age of a batch before it is flushed.
timeout_secs: 1
buffer:
max_size: 5368709120 # 5GB
type: disk
log:
F 2025-05-02T18:49:54.285554Z ERROR transform{component_kind="transform" component_id=pods_source_types component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
F 2025-05-02T18:49:54.290775Z ERROR transform{component_kind="transform" component_id=pods_source_types component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
Events are not super small though (thus we had to increase max_line_bytes).
@pront Hi, is there any updates on this issue? We are facing the same one as well.
@pront Hi, is there any updates on this issue? We are facing the same one as well.
Hi @charlie0129, unfortunately we do not have the capacity to prioritize this. We do have a reproduction which a great first step. Community PRs are always welcome.
We're seeing this issue as well.
we are facing same issues for disk buffer