Tiered buffers on overflow two bugs
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Two problems:
- buffer_type tag in metrics is always reporting memory even if buffer start using disks on overflow
- Some metrics with higher traffic goes to disks - as in problem 1 we don't see when reaching memory limit and start using disk because its reporting always memory buffer in internal metrics - but when reach disk this is growing and looks like not sending or not cleaned once sended from disks
When we look into one of the kpods we see that disk buffer is used and growing just like showed on graphs.
I have no name!@vector-metrics-egress-us-east-1d-2:/data_dir/buffer/v2/datadog_agent_misc_metrics$ ls -la
total 272280
drwxrwsr-x 2 1000 1000 4096 Apr 17 07:28 .
drwxrwsr-x 4 1000 1000 4096 Nov 7 13:27 ..
-rw-rw---- 1 1000 1000 133820872 Apr 16 22:17 buffer-data-26599.dat
-rw-rw---- 1 1000 1000 133983344 Apr 17 07:28 buffer-data-26603.dat
-rw-r----- 1 1000 1000 10990808 Apr 17 07:58 buffer-data-26604.dat
-rw-rw-r-- 1 1000 1000 24 Apr 17 08:41 buffer.db
-rw-r--r-- 1 1000 1000 0 Apr 17 03:23 buffer.lock
No additional info in logs.
Configuration
## non custom metrics comming from dd-agent - non DogstatsD
type: datadog_metrics
inputs:
- metrics_route._unmatched
# - datadog_agent
default_api_key: "SECRET[secrets.METRICS_EGRESS_DD_API_KEY]"
endpoint: "${DD_SITE}" # override for pvlink instead of public site option
buffer:
- type: memory
max_events: 50000
when_full: overflow
- type: disk
max_size: 15000000000 # close to 15GB - total 9GB+ with internal metrics and we have 10GB now on volume.
when_full: block
batch:
max_events: 5000
timeout_secs: 1
acknowledgements:
enabled: False
request:
concurrency: "adaptive"
Version
0.37
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
Even after kpod redeployment old buffers are not flushed from disks
I have no name!@vector-metrics-egress-us-west-2c-0:/data_dir/buffer/v2/datadog_agent_misc_metrics$ ls -lah
total 552M
drwxrwsr-x 2 1000 1000 4.0K Apr 17 08:57 .
drwxrwsr-x 4 1000 1000 4.0K Apr 15 11:17 ..
-rw-rw---- 1 1000 1000 125M Apr 16 14:48 buffer-data-15.dat
-rw-rw---- 1 1000 1000 128M Apr 16 16:33 buffer-data-18.dat
-rw-rw---- 1 1000 1000 126M Apr 15 19:37 buffer-data-1.dat
-rw-rw---- 1 1000 1000 126M Apr 17 04:06 buffer-data-29.dat
-rw-rw---- 1 1000 1000 48M Apr 17 08:54 buffer-data-41.dat
-rw-rw---- 1 1000 1000 24 Apr 17 08:59 buffer.db
-rw-r--r-- 1 1000 1000 0 Apr 17 08:57 buffer.lock
Now two days of buffers still on disks
ok after redeploy number of events drop properly, but buffer remain on disks - this means that after send they are not cleaned ??
Thanks for this report @szibis . I'm not actually sure when the buffer files are deleted when events are processed. @tobz is this something you know off the top of your head?
Maybe they are not processed when reaching disk buffer on overflow and thats why the files are not removed ??