vector unable to be deserialized error on disk buffer block vector sending data out

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Recently we did some perf test with the vector act as the vector server collecting metrics data and forward to Kafka. In our cluster we runs into some OOM crash result in some pods restart, after restart, several replica of the vector has the error as:

2023-08-19T01:52:51.816411Z ERROR sink{component_kind="sink" component_id=*** component_type=kafka component_name=***}: vector_buffers::variants::disk_v2::writer: Last written record was unable to be deserialized. Corruption likely. reason="invalid data: check failed for struct member payload: pointer out of bounds: base 0x7f5c9ce00008 offset 1635021580 not in range 0x7f5c8f3e7000..0x7f5c90000000"

The replicas with this error will still receiving the metrics data(http_server source) but won't forward to Kafka(sink) anymore, because we didn't enable the log by the time of this error happen and reproduce cost a lot to generate load(will update here when we do it again). This result in losing data, I don't think this is the behavior we expected. Not sure if anymore met the similar problem, would like to learn the root cause. The Kafka buffer usage down to zero Screenshot 2023-08-21 at 4 32 18 PM The http source still receiving data The Kafka request count down to zero

We could put some alerting around the service, I just feel like if the buffer io runs into this kind of blocking status, we probably want to removing it from the service, but I can't find how can I configure the http server source buffer to block new data when it is full

Configuration

customConfig:
  api:
    address: *.*.*.*:**
    enabled: true
    playground: false
  data_dir: /var/lib/vector
  sinks:
    kafka_startree:
      type: kafka
      inputs:
        - http_server
      bootstrap_servers: ""
      topic: ""
      encoding:
        codec: "json"
      compression: "lz4"
      buffer:
        type: disk
        when_full: drop_newest
        max_size: 268435488
      sasl: 
        ****
      tls:
        enabled: true
  sources:
    http_server:
      type: http_server
      address: 0.0.0.0:8080
      encoding: json
      method: POST
      headers:

Version

timberio/vector:0.31.0-distroless-libc

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

Aug 21 '23 23:08 ee07b415

From your description, it looks like there's actually three different things going on here:

you're seeing disk buffer errors at startup after pods are restarted
you want to be able to stop accepting data at the source when a disk buffer is full
your Kafka sink stops sending data after a period of time

Buffer errors at startup

In your case, what is happening is that Vector was in the middle of writing a record (event) to the disk buffer as it was killed by Kubernetes. When Vector restarts and validates where it left off with writing to/reading from the disk buffer, it realizes that this partial write occurred, and then emits an error message to inform you of this.

This is expected behavior because Vector cannot ensure atomic writes to disk, and even further, modifications to disk buffer files could occur without Vector even running: manipulating the files manually, filesystem/disk corruption, etc. At best, Vector is able to inform you that the data in the disk buffer is invalid (in some way) and what that means (potentially lost events, etc).

We report this at the error level because it's serious -- as you note, you don't expect your data to be lost, and most users consider lost data to be a big issue -- and we want users to be able to understand when this occurs... but there's nothing to change about it, besides trying to adjust deployments to keep Vector from being OOM killed where possible.

Having sources stop accepting data when a downstream sink has filled its disk buffer

This is actually the default behavior, but you're overriding it in your configuration by specifying when_full: drop_newest. Backpressure propagates from sinks to transforms to sources, so if a disk buffer fills at the sink level, that backpressure will eventually propagate back to your source, which in turn will block callers trying to write more data.

Kafka sink stops sending data

This looks like a potential legitimate problem you're experiencing.

Can you provide some more information about the test you were running? A few things specifically:

what version of Vector are you running?
did the Vector restarts occur at the same time that the processing rate to Kafka drops to zero?

Sep 14 '23 17:09 tobz

Hi @tobz , thanks for replying this issue, its been a long time since my report of the issue, I probably need to pick up and some data may already missing.

We are running the vector version timberio/vector:0.31.0-distroless-libc, the Kafka drops to zero is NOT happen with the same time on the restart because we have the random suffix name with the replica, we won't able to see the metrics from the same replica when killed of kubernetes. For the buffer, I think you comments totally make sense, I will probably change to the other when_full solution so we will not send the data to the buggy replica.

The thing I'm not sure if it is by design is, when the replica runs into this status(block from reading buffer), it is just dead and not recover by it self, we need some manual operation like killed the pod. I totally agree with you this is a design problem, let it in the bug status we will have the chance to notice and debug.

Sep 14 '23 18:09 ee07b415

We saw recently a similar issue. Our vector pods (on 0.32.0 distroless) sending data to Elasticsearch sink started being killed with OOM, and then entering a crash restart loop.

We start seeing the same error message in the logs "Last written record was unable to be deserialized. Corruption likely." with reason "invalid data: check failed for struct member payload: pointer out of bounds: base 0x7fb7b0c00008 offset 1929447951 not in range 0x7fb7b0a00000..0x7fb7b0c00000". Data sent to Elasticsearch drops and after a few hours recovers. The memory on the host machines never gets full, we're not sure what triggered the first OOM nor what helped it recover.

When vector encounters a corrupted disk buffer, what does it do? Continue while accumulating events in memory? How best to track/obtain information on what triggered this? Setting VECTOR_LOG=debug?

Sep 27 '23 21:09 awangc

We saw recently a similar issue. Our vector pods (on 0.32.0 distroless) sending data to Elasticsearch sink started being killed with OOM, and then entering a crash restart loop.

@awangc Best to open a new issue for this.

Oct 16 '23 14:10 tobz

Do we have an update here? I see that this de-serialized error is occurring very often. Is there any workaround to solve it?

Jun 14 '24 10:06 ShahroZafar

Whether drop_newest or block is used, a corrupted message should either be sent to a "dead letter queue" or dropped. Stopping the entire pipeline due to a corrupted message introduces a complete failure in processing messages upon a restart (due to OOM for example). It defeats the purpose of having disk buffers. Is that a bug? How can we proceed?

Jun 24 '24 09:06 orarnon

This is a bug. The only, unfortunate, workaround I'm aware of at the moment is to delete the disk buffer.

Jun 25 '24 20:06 jszwedko