vector An error occurred that Vector couldn’t handle: failed to encode record: BufferTooSmall when using http server source causing the vector to stop ingesting data

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We had a similar issue when using the Splunk HEC source and raised a bug report #17670, We started using the HTTP source and are now seeing a buffer error with the HTTP source causing the vector to stop ingesting any new data the containers are entering a restart loop with OOM error on Kubernetes and on vector we are seeing the below error.

K8's container

team-ingest-prod-0                0/1     OOMKilled          1          1m    19x.x.x.x    vm-ch2-c   <none>

Vector Error:

2023-08-10T11:48:04.553373Z  INFO vector::app: Log level is enabled. level=“vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,lapin=info,kube=info”
2023-08-10T11:48:04.554149Z  INFO vector::app: Loading configs. paths=[“/etc/vector”]
2023-08-10T11:48:06.390284Z  WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `endpoint`. Please use `endpoints` option instead.
2023-08-10T11:48:06.390423Z  WARN vector_core::tls::settings: The `verify_certificate` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:06.390436Z  WARN vector_core::tls::settings: The `verify_hostname` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:06.437918Z  WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `suppress_type_name`. Please use `api_version` option instead.
2023-08-10T11:48:07.994646Z  WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `endpoint`. Please use `endpoints` option instead.
2023-08-10T11:48:07.994726Z  WARN vector_core::tls::settings: The `verify_certificate` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:07.994738Z  WARN vector_core::tls::settings: The `verify_hostname` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:08.039273Z  WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `suppress_type_name`. Please use `api_version` option instead.
2023-08-10T11:48:08.107306Z  INFO vector::topology::runner: Running healthchecks.
2023-08-10T11:48:08.107388Z  INFO vector::topology::builder: Healthcheck passed.
2023-08-10T11:48:08.107538Z  INFO vector::topology::builder: Healthcheck disabled.
2023-08-10T11:48:08.107558Z  INFO source{component_kind=“source” component_id=o11y_ingest_http component_type=http_server component_name=o11y_ingest_http}: vector::sources::util::http::prelude: Building HTTP server. address=0.0.0.0:8081
2023-08-10T11:48:08.107659Z  INFO vector: Vector has started. debug=“false” version=“0.31.0” arch=“x86_64" revision=“38c3f0b 2023-05-22 17:38:48.655488673”
2023-08-10T11:48:08.107867Z  INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:8181
2023-08-10T11:48:08.114700Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=off
2023-08-10T11:48:08.122775Z  INFO vector::topology::builder: Healthcheck passed.
2023-08-10T11:48:08.125420Z  INFO vector::topology::builder: Healthcheck passed.
2023-08-10T11:48:43.659271Z ERROR transform{component_kind=“transform” component_id=remap component_type=remap component_name=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2023-08-10T11:48:43.789150Z ERROR transform{component_kind=“transform” component_id=remap component_type=remap component_name=remap}: vector::topology: An error occurred that Vector couldn’t handle: failed to encode record: BufferTooSmall.
2023-08-10T11:48:43.789228Z  INFO vector: Vector has stopped.
2023-08-10T11:48:43.801112Z  INFO vector::topology::runner: Shutting down... Waiting on running components. remaining_components=“o11y_elastic_internal, metrics, vector, prom_exporter, o11y_ingest_http, o11y_maas_remote_write, internal_logs, remap_un_nest, o11y_ingest, o11y_elastic, remap, internal_metrics” time_remaining=“59 seconds left”
2023-08-10T11:48:48.789690Z  INFO vector::topology::runner: Shutting down... Waiting on running components. remaining_components=“o11y_elastic_internal, remap_un_nest” time_remaining=“54 seconds left”
2023-08-10T11:48:53.790245Z  INFO vector::topology::runner: Shutting down... Waiting on running components. remaining_components=“remap_un_nest” time_remaining=“49 seconds left”

Configuration

data_dir = "/vector-data-dir"

[log_schema]
host_key = "host.hostname"
timestamp_key = "@timestamp"

[api]
enabled = true
address = "127.0.0.1:8686"
playground = false

[sources.internal_metrics]
type = "internal_metrics"

[sources.o11y_ingest]
address = "0.0.0.0:8080"
valid_tokens = [ "<token>" ]
type = "splunk_hec"

[sources.o11y_ingest_http]
address = "0.0.0.0:8081"
type = "http_server"
"framing.method" = "character_delimited"
"framing.character_delimited.delimiter" = ","
"framing.newline_delimited.max_length" = 8_192
"decoding.codec" = "json"
"encoding.codec" = "json"

[sources.internal_logs]
type = "internal_logs"

[transforms.remap_un_nest]
type = "remap"
inputs = [ "o11y_ingest", "o11y_ingest_http" ]
source = """
  .message = parse_json!(.message)
  . = unnest!(.message)"""

[transforms.remap]
type = "remap"
inputs = [ "remap_un_nest" ]
source = """
  . = merge!(., .message)
  parsed_json, err = parse_json(string!(.message))
  if err == null {
      ., err = merge(., parsed_json)
      del(.message)
  } else {
      del(.source_type)
      del(.message)
  }
  del(.source_type)
  del(.path)"""

[transforms.metrics]
type = "remap"
inputs = [ "internal_metrics" ]
source = """
.tags.job = "team-ingest-prod"
.tags.tenant_name = "team"
.tags.instance = "ingest-prod-ch2.example.com:443"
.tags.environment = "prod"
.tags.dc = "ch2"\
"""

[transforms.vector]
type = "remap"
inputs = [ "internal_logs" ]
source = """
    .vector = del(.)
    ."@timestamp" = del(.vector.timestamp)
    .message = del(.vector.message)
    .log = {}
    .log.logger = del(.vector.metadata.module_path)
    .log.level = downcase(string!(del(.vector.metadata.level)))
    .ecs = {}
    .ecs.version = "1.7"
    .cse = {}
    .cse.version = "1.0"
    .agent = {}
    .agent.type = "vector"
    .agent.version = "0.31.0"
    .event = {}
    .event.dataset = ["app"]
    .host = {}
    .host.hostname = get_env_var("ELK_HOSTNAME") ?? null
    .cloud = {}
    .cloud.provider = get_env_var("CLOUD_PROVIDER") ?? null
    .cloud.region = get_env_var("CLOUD_REGION") ?? null
    .tenant = {}
    .tenant.name = "team_prod"
    .service = {}
    .service.name = "team_prod"
    .host.o11y_hostname = get_hostname!()"""

[sinks.prom_exporter]
type = "prometheus_exporter"
inputs = [ "internal_metrics" ]
address = "0.0.0.0:8181"

[sinks.o11y_maas_remote_write]
type = "prometheus_remote_write"
inputs = [ "metrics" ]
default_namespace = "team"
endpoint = "https://<prometheus>.example.com:9090/api/v1/write"
tenant_id = "team_telemetry"

  [sinks.o11y_maas_remote_write.healthcheck]
  enabled = false

[sinks.o11y_elastic]
type = "elasticsearch"
inputs = [ "remap" ]
endpoint = "${ELK_URL}"
compression = "gzip"
suppress_type_name = true

  [sinks.o11y_elastic.bulk]
  index = "logz-team-app"
  action = "create"

  [sinks.o11y_elastic.auth]
  strategy = "basic"
  user = "${ELK_USERNAME}"
  password = "${ELK_PASSWORD}"

  [sinks.o11y_elastic.buffer]
  type = "disk"
  max_size = 900_000_000_000
  when_full = "drop_newest"

  [sinks.o11y_elastic.batch]
  max_bytes = 8_300_000
  timeout_secs = 8

  [sinks.o11y_elastic.request]
  concurrency = "adaptive"
  timeout_secs = 120

  [sinks.o11y_elastic.tls]
  verify_certificate = false
  verify_hostname = false

[sinks.o11y_elastic_internal]
type = "elasticsearch"
inputs = [ "vector" ]
compression = "gzip"
endpoint = "${ELK_URL}"
suppress_type_name = true

  [sinks.o11y_elastic_internal.bulk]
  index = "logz-forwarder"
  action = "create"

  [sinks.o11y_elastic_internal.auth]
  strategy = "basic"
  user = "${ELK_USERNAME}"
  password = "${ELK_PASSWORD}"

  [sinks.o11y_elastic_internal.buffer]
  type = "disk"
  max_size = 400_000_000
  when_full = "drop_newest"

  [sinks.o11y_elastic_internal.batch]
  max_bytes = 8_300_000
  timeout_secs = 8

  [sinks.o11y_elastic_internal.request]
  retry_attempts = 10

  [sinks.o11y_elastic_internal.tls]
  verify_certificate = false
  verify_hostname = false

Version

0.31.x

Debug Output

No response

Example Data

No response

Additional Context

Vector is running in Kubernetes and this specific client has large payloads with close to 6000-8000 fields in their entire dataset.

References

#17670 Faced similar issue with splunk hec source we moved to http source and are seeing a new error.

Aug 22 '23 17:08 raghu999

Thanks for this report @raghu999 ! Did you share the right logs though? The error logs there seem to be showing an error with the disk buffers. Do you have logs showing the panic in the http_server source?

Aug 23 '23 19:08 jszwedko

@jszwedko Updated the issue we are only seeing the buffer error but it is the same dataset that we were seeing an error when using the Splunk HEC source and when we switched to using the HTTP server source we now are seeing a new error that is related to buffers. We confirmed that the disk buffer is not full when this issue happens.

Aug 24 '23 13:08 raghu999

Gotcha, thanks for clarifying @raghu999 ! It sounds more like a bug in the disk buffer implementation than the http source then.

Aug 24 '23 15:08 jszwedko

This error comes from our limitations around the maximum record size allowed in a disk buffer, which is set to 128MB. A record is essentially a chunk of events written all together, which based on the sources used in the given configuration, would be any events decoded from a single request.

This would imply that if the error above is being hit, that a single request sent to either the splunk_hec or http_server source is very large, something like at least 90-100MB: each decoded event will have some internal overhead, so the request doesn't need to be 128MB by itself.

Based on your knowledge of the clients sending to your Vector instance, is it possible for requests to be that large (90-100MB)?

Sep 14 '23 18:09 tobz

Hi everyone! Facing the same issue here:

Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.412419Z ERROR transform{component_kind="transform" component_id=parsing_transform_new component_type=remap component_name=parsing_transform_new}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.414728Z ERROR transform{component_kind="transform" component_id=parsing_transform_new component_type=remap component_name=parsing_transform_new}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.415193Z  INFO vector: Vector has stopped.

This is the transform:

parsing_transform_new:
  type: remap
  inputs:
    - profile_route.parsing
  source: |-
    if (springboot_match, err = parse_regex(.message, r'^(?P<timestamp>\d+(?:-\d+){2}\s+\d+(?::\d+){2}\.\d+)\s+(?P<log_level>\S+)\s+(?P<pid>\d+)\s+---\s+\[(?P<thread_name>[\s+\S]*?)\]\s+(?P<logger_name>\S+)\s+:\s+(?s)(?P<message>.*)'); err == null) { # java_springboot
      .timestamp = parse_timestamp!(springboot_match.timestamp, "%F %T.%3f")
      .log_level = downcase!(springboot_match.log_level)
      .pid = to_int!(springboot_match.pid)
      .thread_name = strip_whitespace!(springboot_match.thread_name)
      .logger_name = springboot_match.logger_name
      .message = to_string(springboot_match.message)
    } else if (python_match, err = parse_regex(.message, r'^\[(?P<timestamp>\d+-\d+-\d+\s\d+:\d+:\d+,\d+)\s(?P<log_level>\w+)\s(?P<logger_name>\S+)\s(?P<thread_name>\w+\S+\s\S+)]\s(?s)(?P<message>.*)'); err == null) { # python
      .timestamp = parse_timestamp!(python_match.timestamp, "%F %T,%3f")
      .log_level  = downcase!(python_match.log_level)
      .logger_name = python_match.logger_name
      .thread_name = strip_whitespace!(python_match.thread_name)
      .message = to_string(python_match.message)
    } else if (vault_match, err = parse_regex(.message, r'^(?P<timestamp>\d+-\d+-\d+T\d+:\d+:\d+.\d+Z)\s\[(?P<log_level>\w+)]\s+(?s)(?P<message>.*)$'); err == null) { # vault
      .timestamp = vault_match.timestamp
      .log_level = downcase!(vault_match.log_level)
      .message = to_string(vault_match.message)
    } else {
      .malformed = true
    }

In my case it is unlikely for requests to be even close to 90-100mb. v0.32.0

Sep 18 '23 08:09 maxemontio

We're seeing this error again with a different application. Some relevant log lines

vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2023-11-15T23:45:07.499790Z DEBUG transform{component_kind="transform" component_id=remap component_type=remap}: vector::topology::builder: Synchronous transform finished with an error.
2023-11-15T23:45:07.499819Z ERROR transform{component_kind="transform" component_id=remap component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2023-11-15T23:45:07.499874Z  INFO vector: Vector has stopped.
2023-11-15T23:45:07.499934Z DEBUG vector_buffers::variants::disk_v2::writer: Writer marked as closed.
2023-11-15T23:45:07.500314Z DEBUG source{component_kind="source" component_id=internal_logs component_type=internal_logs}: vector::topology::builder: Source finished normally.

The source is http_server. The symptoms is that memory utilization in the pod would grow to the point of being OOM killed. But given the log above 'source finished normally' vector might just have exited?

version we used: v0.34.0

A few questions:

So the remap transform is also accessing disk buffer while doing the transform?
If transforms also access disk buffers, is there a configuration on the disk buffer size for transforms?
When we removed the transform our pods would no longer be OOMkilled, so this record limitation is not a record size limitation but a transform limitation?
Is there a way for vector to print/obtain the event size inside a remap transform?
What does Synchronous transform finished with an error. mean (what function call might trigger this error message , or maybe, are all transforms synchronous. what characterizes a "synchronous" transform?)

Nov 16 '23 00:11 awangc

I am seeing similar issues

2024-02-20T09:21:25.663269Z ERROR transform{component_kind="transform" component_id=kubernetes_application component_type=remap component_name=kubernetes_application}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2024-02-20T09:21:25.663344Z  INFO vector: Vector has stopped.
2024-02-20T09:21:25.663365Z ERROR transform{component_kind="transform" component_id=route_logs component_type=route component_name=route_logs}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-02-20T09:21:25.676755Z ERROR transform{component_kind="transform" component_id=containerlogs component_type=remap component_name=containerlogs}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-02-20T09:21:25.680198Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="splunk_kube_audit, fluentbit, kubernetes_application, prometheus_exporter, splunk_container_logs, kubernetes_audit, httpd, vector_metrics, kubernetes_infrastructure, route_logs, journal_logs, containerlogs" time_remaining="59 seconds left"

The containers exists upon certain load , around 10k events/sec.

Feb 20 '24 09:02 kefiras

@awangc have You managed to find the root cause or workaround ? I've seen similar issues while sending logs from upstream vector pods that are part of OpenShift and the default batch size is 10MB I believe. I've been testing with sending just generic nginx logs so really small events. Upon certain load the vector would exit with same error related to the buffer. I have observed the issue not happening when the batch size on upstream vector is set to 1MB (so 10 times less).

@jszwedko is there any doc how are the transforms related to a disk buffers ?

Oct 31 '24 11:10 kefiras

@jszwedko is there any doc how are the transforms related to a disk buffers ?

Transforms emit data to an output channel which is then consumed by buffers. Is that the sort of detail you were looking for? Or do you have different sort of question?

For this issue, I know it is a lot to ask, but if anyone is able to create a standalone reproduction, that would aid with debugging.

Nov 01 '24 22:11 jszwedko

I have managed a create a test case to recreate this issue. It seems to be a combination of large events, a transform and disk buffers for a sink.

I created a large json test event with a "message" field also containing json, approx 10Mb. I then just replicated that same event multiple times into 3 different files, each about 500Mb. I can provide the example event if required.

Below is the config to replicate it along with the output of a run.

api:
  enabled: true
data_dir: /tmp/vector/data
sources:
  events:
    type: file
    include:
      - /tmp/vector/source-events/*.json
    ignore_checkpoints: true
    fingerprinting:
      strategy: device_and_inode
    max_line_bytes: 524288000
sinks:
 blackhole:
   type: blackhole
   inputs:
     - split_events
   buffer:
     type: disk
     max_size: 1073741824
     when_full: block
   print_interval_secs: 10
   rate: 10
transforms:
  split_events:
    type: remap
    inputs:
      - events
    source: |-
      . = parse_json!(.message)

$ ./vector-0.44.0 -c config.yaml
2025-01-20T23:35:07.768304Z  INFO vector::app: Log level is enabled. level="info"
2025-01-20T23:35:07.769396Z  INFO vector::app: Loading configs. paths=["config.yaml"]
2025-01-20T23:35:08.043441Z  INFO vector::topology::running: Running healthchecks.
2025-01-20T23:35:08.043814Z  INFO vector::topology::builder: Healthcheck passed.
2025-01-20T23:35:08.044247Z  INFO vector: Vector has started. debug="false" version="0.44.0" arch="x86_64" revision="3cdc7c3 2025-01-13 21:26:04.735691656"
2025-01-20T23:35:08.043928Z  INFO source{component_kind="source" component_id=events component_type=file}: vector::sources::file: Starting file server. include=["/tmp/vector/source-events/*.json"] exclude=[]
2025-01-20T23:35:08.045909Z  INFO source{component_kind="source" component_id=events component_type=file}:file_server: file_source::checkpointer: Loaded checkpoint data.
2025-01-20T23:35:08.046107Z  INFO source{component_kind="source" component_id=events component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/tmp/vector/source-events/large_event.json
2025-01-20T23:35:08.046154Z  INFO source{component_kind="source" component_id=events component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/tmp/vector/source-events/large_events.json
2025-01-20T23:35:08.046180Z  INFO source{component_kind="source" component_id=events component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/tmp/vector/source-events/large_events_2.json
2025-01-20T23:35:08.046203Z  INFO source{component_kind="source" component_id=events component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/tmp/vector/source-events/large_events_3.json
2025-01-20T23:35:08.056236Z  INFO vector::sinks::blackhole::sink: Collected events. events=0 raw_bytes_collected=0
2025-01-20T23:35:08.069372Z  INFO vector::internal_events::api: API server running. address=[127.0.0.1:8686](http://127.0.0.1:8686/) playground=http://127.0.0.1:8686/playground graphql=http://127.0.0.1:8686/graphql
2025-01-20T23:35:18.817311Z  INFO vector::sinks::blackhole::sink: Collected events. events=21 raw_bytes_collected=203819011
2025-01-20T23:35:27.231962Z ERROR transform{component_kind="transform" component_id=split_events component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2025-01-20T23:35:27.235103Z ERROR transform{component_kind="transform" component_id=split_events component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2025-01-20T23:35:27.239027Z  INFO vector: Vector has stopped.
2025-01-20T23:35:27.247389Z  INFO vector::sinks::blackhole::sink: Collected events. events=21 raw_bytes_collected=203819011
2025-01-20T23:35:27.247922Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="events" time_remaining="59 seconds left"
2025-01-20T23:35:27.642362Z ERROR source{component_kind="source" component_id=events component_type=file}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.

Jan 20 '25 23:01 DenverJ

@jszwedko are you able to replicate this with what I have provided above?

Feb 13 '25 23:02 DenverJ

@jszwedko are you able to replicate this with what I have provided above?

Hey! Thank you so much for providing the reproduction case. That is incredibly helpful. I haven't had the time to dig into this yet, but @pront is going to take it up instead.

Feb 14 '25 20:02 jszwedko

Hi @DenverJ, thank you for providing the repo steps. I managed to reproduced this bug locally with your setup.

❯ ls -ltr
total 1542156
-rw-r--r-- 1 pavlos.rontidis 513807973 Feb 25 15:51 file1.json
-rw-r--r-- 1 pavlos.rontidis 513807973 Feb 25 15:51 file2.json
-rw-r--r-- 1 pavlos.rontidis 513807973 Feb 25 15:51 file3.json

Feb 25 '25 20:02 pront

The inner error was:

failed to encode Protobuf message; insufficient buffer capacity (required: 817903482, remaining: 134217680)

Feb 25 '25 21:02 pront

One more variation of the config for the bug:

  sources:
    pods_logs:
      type: kubernetes_logs
      max_line_bytes: 131072

  transforms:
    pods_source_types:
      type: remap
      inputs:
        - pods_logs
      source: |-
        .source = "pods_logs"
        .source_environment  = "some-env"

  sinks:
    to_kafka:
      type: kafka
      inputs:
        - pods_source_types
      bootstrap_servers: ...
      compression: zstd
      encoding:
        codec: "json"
      topic: logs-kubernetes-some-env
      batch:
        max_bytes: 1048576 # 1MB
        # The maximum size of a batch before it is flushed.
        max_events: 100
        # The maximum age of a batch before it is flushed.
        timeout_secs: 1
      buffer:
        max_size: 5368709120 # 5GB
        type: disk

log:

F 2025-05-02T18:49:54.285554Z ERROR transform{component_kind="transform" component_id=pods_source_types component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
F 2025-05-02T18:49:54.290775Z ERROR transform{component_kind="transform" component_id=pods_source_types component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.

Events are not super small though (thus we had to increase max_line_bytes).

May 03 '25 10:05 pznamensky

@pront Hi, is there any updates on this issue? We are facing the same one as well.

Aug 25 '25 05:08 charlie0129

@pront Hi, is there any updates on this issue? We are facing the same one as well.

Hi @charlie0129, unfortunately we do not have the capacity to prioritize this. We do have a reproduction which a great first step. Community PRs are always welcome.

Aug 25 '25 18:08 pront

We're seeing this issue as well.

Sep 15 '25 16:09 ebusto

we are facing same issues for disk buffer

Nov 27 '25 16:11 dbhojani