Incorrectly checkpointing journald logs when unable to send to sink
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
We have vector collecting from systemd units via journald and forwarding to the CloudWatch Logs (CWL) sink. Sometimes we suspend servers for long periods of time and then resume the server later. When this happens the AWS credentials are no longer valid. Upon restarting the server Vector starts up and begins attempting to forward logs to CWL but the API calls fail and eventually Vector exits and is restarted by systemd.
After getting new credentials and restarting vector we don't receive any logs that were generated under the stale AWS credentials. My best guess is that vector incorrectly checkpoints the journald stream even though the logs never successfully upload to CWL.
Configuration
healthchecks.enabled = false
[sources.postgres]
type = "syslog"
mode = "unix"
path = "/run/postgres-audit.socket"
[sources.kernel]
type = "journald"
current_boot_only = true
include_matches = { _TRANSPORT = ["kernel"] }
[transforms.postgres_xform]
type = "remap"
inputs = [ "postgres" ]
source = """
.SYSLOG_IDENTIFIER = "postgres"
"""
[transforms.kernel_xform]
type = "remap"
inputs = [ "kernel" ]
source = """
. = {
# SYSLOG_IDENTIFIER is used by sink to forward to the appropriate Cloudwatch Logs stream
"SYSLOG_IDENTIFIER": "kernel",
"boot_id": ._BOOT_ID,
"message": .message
}
"""
[sources.pgbouncer]
type = "journald"
current_boot_only = false
include_units = [ "[email protected]" ]
[sources.sshd]
type = "journald"
current_boot_only = false
include_units = [ "sshd" ]
[sources.auditd]
type = "file"
include = [ "/var/log/audit/audit.log*" ]
read_from = "beginning"
[transforms.auditd_xform]
type = "remap"
inputs = [ "auditd" ]
source = """
. |= parse_key_value!(.message)
.SYSLOG_IDENTIFIER = "audit"
"""
[sinks.cloudwatch_pg_audit]
type = "aws_cloudwatch_logs"
inputs = [ "auditd_xform", "postgres_xform", "sshd", "pgbouncer", "kernel_xform" ]
create_missing_group = false
create_missing_stream = false
group_name = "zxcv"
stream_name = "asdf-{{ SYSLOG_IDENTIFIER }}"
region = "us-west-2"
encoding.codec = "json"
Version
vector 0.26.0 (x86_64-unknown-linux-gnu c6b5bc2 2022-12-05)
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
Hey @rbishop sorry to hear you're having problems. Can you please share your configuration?
@spencergilbert added to the issue body
Thanks! This behavior is seen on all of your journald sources? Is the file source watching auditd exhibiting the same?
The file source is working properly.
@rbishop I actually think this is correct behavior from Vector. According to the journald docs, the checkpointing happens after a read: https://vector.dev/docs/reference/configuration/sources/journald/#checkpointing
One way to overcome this issue is to use the acknowledgements feature in sinks, like in the aws_cloudwatch_logs sink: https://vector.dev/docs/reference/configuration/sinks/aws_cloudwatch_logs/#acknowledgements
Note: acknowledgements won't work for the syslog sink. There isn't much we can do there because checkpointing isn't supported by the socket interface.
Closing this since this is expected behavior.