vector icon indicating copy to clipboard operation
vector copied to clipboard

Tiered buffers on overflow two bugs

Open szibis opened this issue 1 year ago • 4 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Two problems:

  1. buffer_type tag in metrics is always reporting memory even if buffer start using disks on overflow
  2. Some metrics with higher traffic goes to disks - as in problem 1 we don't see when reaching memory limit and start using disk because its reporting always memory buffer in internal metrics - but when reach disk this is growing and looks like not sending or not cleaned once sended from disks
image

When we look into one of the kpods we see that disk buffer is used and growing just like showed on graphs.

I have no name!@vector-metrics-egress-us-east-1d-2:/data_dir/buffer/v2/datadog_agent_misc_metrics$ ls -la
total 272280
drwxrwsr-x 2 1000 1000      4096 Apr 17 07:28 .
drwxrwsr-x 4 1000 1000      4096 Nov  7 13:27 ..
-rw-rw---- 1 1000 1000 133820872 Apr 16 22:17 buffer-data-26599.dat
-rw-rw---- 1 1000 1000 133983344 Apr 17 07:28 buffer-data-26603.dat
-rw-r----- 1 1000 1000  10990808 Apr 17 07:58 buffer-data-26604.dat
-rw-rw-r-- 1 1000 1000        24 Apr 17 08:41 buffer.db
-rw-r--r-- 1 1000 1000         0 Apr 17 03:23 buffer.lock

No additional info in logs.

Configuration

## non custom metrics comming from dd-agent - non DogstatsD
type: datadog_metrics
inputs:
  - metrics_route._unmatched
#  - datadog_agent
default_api_key: "SECRET[secrets.METRICS_EGRESS_DD_API_KEY]"
endpoint: "${DD_SITE}" # override for pvlink instead of public site option
buffer:
  - type: memory
    max_events: 50000
    when_full: overflow
  - type: disk
    max_size: 15000000000 # close to 15GB - total 9GB+ with internal metrics and we have 10GB now on volume.
    when_full: block
batch:
  max_events: 5000
  timeout_secs: 1
acknowledgements:
  enabled: False
request:
  concurrency: "adaptive"

Version

0.37

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

szibis avatar Apr 17 '24 08:04 szibis

Even after kpod redeployment old buffers are not flushed from disks

I have no name!@vector-metrics-egress-us-west-2c-0:/data_dir/buffer/v2/datadog_agent_misc_metrics$ ls -lah
total 552M
drwxrwsr-x 2 1000 1000 4.0K Apr 17 08:57 .
drwxrwsr-x 4 1000 1000 4.0K Apr 15 11:17 ..
-rw-rw---- 1 1000 1000 125M Apr 16 14:48 buffer-data-15.dat
-rw-rw---- 1 1000 1000 128M Apr 16 16:33 buffer-data-18.dat
-rw-rw---- 1 1000 1000 126M Apr 15 19:37 buffer-data-1.dat
-rw-rw---- 1 1000 1000 126M Apr 17 04:06 buffer-data-29.dat
-rw-rw---- 1 1000 1000  48M Apr 17 08:54 buffer-data-41.dat
-rw-rw---- 1 1000 1000   24 Apr 17 08:59 buffer.db
-rw-r--r-- 1 1000 1000    0 Apr 17 08:57 buffer.lock

Now two days of buffers still on disks

szibis avatar Apr 17 '24 09:04 szibis

ok after redeploy number of events drop properly, but buffer remain on disks - this means that after send they are not cleaned ?? image

szibis avatar Apr 17 '24 09:04 szibis

Thanks for this report @szibis . I'm not actually sure when the buffer files are deleted when events are processed. @tobz is this something you know off the top of your head?

jszwedko avatar Apr 18 '24 23:04 jszwedko

Maybe they are not processed when reaching disk buffer on overflow and thats why the files are not removed ??

szibis avatar Apr 19 '24 05:04 szibis