amazon-cloudwatch-agent icon indicating copy to clipboard operation
amazon-cloudwatch-agent copied to clipboard

amazon-cloudwatch-agent doesnt respect SIGTERM to flush metrics

Open dipeshgautamsol opened this issue 2 years ago • 10 comments

Describe the bug

I am trying to run the aws-cloudwatch-agent as part of a lambda extension, so that metrics sent by lambda executions are batched. Basically, I have a go program that runs this command under the hood with some static configuration files that are pasted below amazon-cloudwatch-agent -config config.toml -otelconfig config.yaml So far, the bootup works. However, the problem is that the agent doesnt seem to flush the metrics when it receives a SIGTERM signal. It just seems to close all extensions and shutdown without flushing any metrics to cloudwatch. This is really problematic for using inside a lambda extension because until a lambda environment shutdown if the flush interval is not reached, metrics are lost completely

Apparently, the telegraf agent understands a SIGTERM or SIGUSR1 and flushes its cache before shutting down.

This issue already goes in this direction.

What did you expect to see? The metrics should be flushed during shutdown

What did you see instead? The metrics are lost

What config did you use?

[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "60s"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = true
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.statsd]]
    interval = "10s"
    parse_data_dog_tags = true
    service_address = ":8125"
    [inputs.statsd.tags]
      "aws:AggregationInterval" = "10s"

[outputs]

  [[outputs.cloudwatch]]

config.yaml

connectors: {}
exporters:
    awscloudwatch:
        force_flush_interval: 10s
        max_datums_per_call: 1000
        max_values_per_datum: 150
        namespace: somewhere
        region: eu-central-1
        resource_to_telemetry_conversion:
            enabled: true
extensions: {}
processors: {}
receivers:
    telegraf_statsd:
        collection_interval: 10s
        initial_delay: 1s
        timeout: 0s
service:
    extensions: []
    pipelines:
        metrics/host:
            exporters:
                - awscloudwatch
            processors: []
            receivers:
                - telegraf_statsd

dipeshgautamsol avatar Nov 15 '23 16:11 dipeshgautamsol

here are the shutdown logs. I would have assumed the flushing to have happened between Stopping and Stopped state of Cloudwatch output plugin

{\"caller\":\"otelcol/collector.go:250\",\"msg\":\"Received signal from OS\",\"signal\":\"terminated\"}\n2023-11-15T17:09:17Z I! {\"caller\":\"service/service.go:170\",\"msg\":\"Starting shutdown...\"}\n","time":"2023-11-15T17:09:17Z"}
Profiler dump:\n[no stats is available...]\n","time":"2023-11-15T17:09:17Z"}
{\"caller\":\"adapter/receiver.go:76\",\"msg\":\"Shutdown adapter\",\"kind\":\"receiver\",\"name\":\"telegraf_statsd\",\"data_type\":\"metrics\",\"receiver\":\"statsd\"}\n2023-11-15T17:09:17Z D! Stopping the statsd service\n","time":"2023-11-15T17:09:17Z"}
Stopped the statsd service\n","time":"2023-11-15T17:09:17Z"}
Profiler is stopped during shutdown\n","time":"2023-11-15T17:09:17Z"}
Stopping the CloudWatch output plugin\n","time":"2023-11-15T17:09:17Z"}
Stopped the CloudWatch output plugin\n","time":"2023-11-15T17:09:17Z"}
{\"caller\":\"extensions/extensions.go:45\",\"msg\":\"Stopping extensions...\"}\n2023-11-15T17:09:17Z I! {\"caller\":\"service/service.go:184\",\"msg\":\"Shutdown complete.\"}\n","time":"2023-11-15T17:09:17Z"}

dipeshgautamsol avatar Nov 15 '23 17:11 dipeshgautamsol

Hello,

Is it possible for you to provide as with cloudwatch-agent logs from /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log . This will give us more insight about the issue. After that I'll investigate the shutdown mechanism to find the root cause.

okankoAMZ avatar Nov 15 '23 17:11 okankoAMZ

The logs posted above should be the same logs as written in this logfile. Since Im running the agent in my process, the logs are part of the output from my process

dipeshgautamsol avatar Nov 15 '23 22:11 dipeshgautamsol

Here are the steps to reproduce it locally

  1. Build the cw agent binaries locally
  2. Using the configuration files above, run this command amazon-cloudwatch-agent -config config.toml -otelconfig amazon-cloudwatch-agent.yaml. The flush interval is set to 10seconds.
  3. Emit some metrics. Here is a python script that emits metrics
# pip3 install datadog
import time
import random
from datadog import initialize, statsd

def lambda_handler(event, context):
    UDP_IP = "127.0.0.1"
    UDP_PORT = 8125
 
    options = {
        "statsd_host": "127.0.0.1",
        "statsd_port": 8125,
        "disable_telemetry" :True,
        "statsd_constant_tags": [
            "Owner:someone"
        ]
    }
    
    initialize(**options)
    for lp in range(100):
        time.sleep(0.02)
        statsd.increment("early_shutdown", tags= ["SubProcess:write-timestream-metric"])
  1. Run the metrics emitter. Leave the agent running for 10seconds. Metrics are exported
  2. Run the metrics emitter. Send a sigterm to the agent before 10seconds. Metrics are lost

Shutdown sequence

2023-11-16T09:35:48Z D! [logagent] open file count, 0
^C2023-11-16T09:35:49Z D! Profiler dump:
[no stats is available...]
2023-11-16T09:35:49Z I! Profiler is stopped during shutdown
2023-11-16T10:35:49.123+0100	info	otelcol/collector.go:250	Received signal from OS	{"signal": "interrupt"}
2023-11-16T10:35:49.123+0100	info	service/service.go:170	Starting shutdown...
2023-11-16T09:35:49Z D! Stopping the statsd service
2023-11-16T09:35:49Z D! Stopped the statsd service
2023-11-16T09:35:49Z D! Stopping the CloudWatch output plugin
2023-11-16T09:35:49Z D! Stopped the CloudWatch output plugin
2023-11-16T09:35:49Z D! [outputs.cloudwatch] LogThrottleRetryer watch throttle events goroutine exiting
2023-11-16T10:35:49.156+0100	info	extensions/extensions.go:45	Stopping extensions...
2023-11-16T10:35:49.156+0100	info	service/service.go:184	Shutdown complete.

Also note that the behavior is the same if I set the intervals of telegraf to 1s (keeping the otel flush interval at 10sec).

Sometimes: If we wait for a bit, sometimes there are hints of metrics being flushed but they take more than a couple of seconds to flush but this is not consistent. I am not sure if its the Otel collector or the statsd

2023-11-16T09:49:47Z D! Stopping the CloudWatch output plugin
2023-11-16T09:49:47Z D! CloudWatch Close, 0th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:48Z D! CloudWatch Close, 1th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:49Z D! CloudWatch Close, 2th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:50Z D! CloudWatch Close, 3th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:51Z D! Stopped the CloudWatch output plugin

Also we cant afford 3 seconds of flush time, because a lambda extension is shutdown by AWS lambda after a max of 2seconds

dipeshgautamsol avatar Nov 16 '23 09:11 dipeshgautamsol

Hello, I am working on recreating the issue and wanted to double check few points:

  1. What os were you using to build the agent
  2. What go version did you use to build the agent
  3. Have you tried using a already built agent and does the issue continue to exist there?

okankoAMZ avatar Nov 16 '23 19:11 okankoAMZ

FROM golang:1.21-alpine3.18 as builder
RUN git clone https://github.com/aws/amazon-cloudwatch-agent.git
RUN cd amazon-cloudwatch-agent && CGO_ENABLED=0 go build -trimpath -buildmode=default\
    -ldflags="-X github.com/aws/amazon-cloudwatch-agent/cfg/agentinfo.VersionStr=2.6.0"\
    -o /opt/extensions/bin/amazon-cloudwatch-agent github.com/aws/amazon-cloudwatch-agent/cmd/amazon-cloudwatch-agent

Built the agent on a darwin arm64 and linux x86_64 as well, both with the same effect

dipeshgautamsol avatar Nov 17 '23 08:11 dipeshgautamsol

Hello @okankoAMZ were you able to somehow reproduce the issue?

dipeshgautamsol avatar Dec 12 '23 12:12 dipeshgautamsol

Hi! I have been able to reproduce it yet; however, I have created a backlog item for us to investigate this issue.

okankoAMZ avatar Feb 07 '24 19:02 okankoAMZ

@okankoAMZ any news regarding this ?

mihaileu avatar Mar 24 '25 12:03 mihaileu

@okankoAMZ any news regarding this ?

Really, more than a year has passed with no updates whatsoever?

isavcic avatar May 19 '25 11:05 isavcic