vector icon indicating copy to clipboard operation
vector copied to clipboard

Log data carries over to new events when sending batched events to the Splunk HEC input

Open MadsRC opened this issue 2 years ago • 0 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

When sending batched log events to the Splunk HEC source, Vector will carry over data from keys in previous events not present in the current event.

I've reproduced this using the 0.30.0-debian and nightly-debian (nightly as of today).

As an example, review the following output from vector, when send the data (curl -v "http://localhost:8088/services/collector" -H "Authorization: Splunk secret" -d @data.json) posted under "Example data":

2023-06-20T17:42:31.300147Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-06-20T17:42:31.301764Z  INFO vector::app: Loading configs. paths=["/etc/vector/vector.toml"]
2023-06-20T17:42:31.305788Z  INFO vector::topology::running: Running healthchecks.
2023-06-20T17:42:31.306012Z  INFO vector: Vector has started. debug="false" version="0.31.0" arch="aarch64" revision="d122d32 2023-06-20 04:03:27.361000217"
2023-06-20T17:42:31.306017Z  INFO vector::topology::builder: Healthcheck passed.
2023-06-20T17:42:31.306025Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
{"host":"orignalHostA","message":"The Project Gutenberg eBook of The Divine Comedy, by Dante Alighieri","source_type":"splunk_hec","splunk_index":"overriddenIndexB","splunk_source":"originalSourceA","splunk_sourcetype":"originalSourcetypeA","timestamp":"2023-06-20T12:39:30Z"}
{"host":"overriddenHostB","message":"whatsoever. You may copy it, give it away or re-use it under the terms","source_type":"splunk_hec","splunk_index":"overriddenIndexB","splunk_source":"originalSourceA","splunk_sourcetype":"originalSourcetypeB","timestamp":"2023-06-20T12:39:30Z"}

Noticed how the second event, which in the example data contains values for host and sourcetype, but in the Vector output it suddenly contains values for splunk_index and splunk_source in addition to the expected splunk_sourcetype and host. My guess is that it carries over the values from the previous event, given that the previous events contains source and index keys with values that match.

I suppose that when receiving batched events, Vector iterates them and breaks them out into individual log events, but forgets to clear the state between runs.

Configuration

[sources.in]
type = "splunk_hec"
valid_tokens = [ "secret" ]

[sinks.console]
inputs = ["in"]
target = "stdout"
type = "console"
encoding.codec = "json"

Version

0.30.0

Debug Output

No response

Example Data

{"time":1687264770,"host":"orignalHostA","source":"originalSourceA","sourcetype":"originalSourcetypeA","index":"overriddenIndexB","event":"The Project Gutenberg eBook of The Divine Comedy, by Dante Alighieri"}{"time":1687264770,"host":"overriddenHostB","sourcetype":"originalSourcetypeB","event":"whatsoever. You may copy it, give it away or re-use it under the terms"}

Additional Context

No response

References

No response

MadsRC avatar Jun 20 '23 17:06 MadsRC