amazon-cloudwatch-agent icon indicating copy to clipboard operation
amazon-cloudwatch-agent copied to clipboard

E! Cannot translate JSON config into TOML, ERROR is exit status 1

Open Cumming5412 opened this issue 2 years ago • 1 comments

Describe the bug On Windows Server 2019 Datacenter edition, on EC2 startup the Cloudwatch agent errors and does not start. However, if you manually start the service it starts fine. I have also noticed if you set the startup type to Automatic (Delayed Start) then it also works fine. I've noticed in the log it detects the instance as OnPremise, but this is incorrect; Its an EC2. When starting manually after Windows has started it correctly detects EC2.

Contents of C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\amazon-cloudwatch-agent 2023/08/02 09:38:04 I! D! [EC2] Found active network interface E! [EC2] Cannot get EC2 Metadata from IMDS: EC2 metadata is not available. I! Detected the instance is OnPremise 2023/08/02 09:38:02 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent.json ... C:\ProgramData\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it. 2023/08/02 09:38:03 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\Configs\ssm_AmazonCloudWatch-windows ... 2023/08/02 09:38:03 I! Valid Json input schema. Got Home directory: C:\Users\Administrator I! Set home dir windows: C:\Users\Administrator I! SDKRegionWithCredsMap region:
Got Home directory: C:\Users\Administrator 2023/08/02 09:38:04 E! Failed to generate configuration validation content. 2023/08/02 09:38:04 E! Failed to generate configuration validation content. 2023/08/02 09:38:04 Under path : /agent/ruleRegion/ | Error : Region info is missing for mode: onPrem 2023/08/02 09:38:04 Configuration validation first phase failed. Agent version: 1.0. Verify the JSON input is only using features supported by this version.

2023/08/02 09:38:04 I! Return exit error: exit code=1 2023/08/02 09:38:04 E! Cannot translate JSON config into TOML, ERROR is exit status 1

However, if I manually start the service after Windows has started the log looks like: 2023/08/02 09:53:44 I! D! [EC2] Found active network interface I! Detected the instance is EC2 2023/08/02 09:53:44 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent.json ... C:\ProgramData\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it. 2023/08/02 09:53:44 Reading json config file path: C:\ProgramData\Amazon\AmazonCloudWatchAgent\Configs\ssm_AmazonCloudWatch-windows ... 2023/08/02 09:53:44 I! Valid Json input schema. I! Trying to detect region from ec2 No csm configuration found. No log configuration found. Configuration validation first phase succeeded

2023/08/02 09:53:44 I! Config has been translated into TOML C:\ProgramData\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent.toml 2023/08/02 09:53:44 D! toml config [agent] collection_jitter = "0s" debug = false flush_interval = "1s" flush_jitter = "0s" hostname = "" interval = "60s" logfile = "C:\ProgramData\Amazon\AmazonCloudWatchAgent\Logs\amazon-cloudwatch-agent.log" logtarget = "lumberjack" metric_batch_size = 1000 metric_buffer_limit = 10000 omit_hostname = false precision = "" quiet = false round_interval = false

[inputs]

[[inputs.statsd]] interval = "10s" parse_data_dog_tags = true service_address = ":8125" [inputs.statsd.tags] "aws:AggregationInterval" = "60s" metricPath = "metrics"

[[inputs.win_perf_counters]] DisableReplacer = true interval = "60s"

[[inputs.win_perf_counters.object]]
  Counters = ["% Free Space"]
  Instances = ["*"]
  Measurement = "LogicalDisk"
  ObjectName = "LogicalDisk"
  WarnOnMissing = true

[[inputs.win_perf_counters.object]]
  Counters = ["% Committed Bytes In Use"]
  Instances = ["------"]
  Measurement = "Memory"
  ObjectName = "Memory"
  WarnOnMissing = true

[[inputs.win_perf_counters.object]]
  Counters = ["% Usage"]
  Instances = ["*"]
  Measurement = "Paging File"
  ObjectName = "Paging File"
  WarnOnMissing = true

[[inputs.win_perf_counters.object]]
  Counters = ["% Disk Time"]
  Instances = ["*"]
  Measurement = "PhysicalDisk"
  ObjectName = "PhysicalDisk"
  WarnOnMissing = true

[[inputs.win_perf_counters.object]]
  Counters = ["% User Time", "% Idle Time", "% Interrupt Time"]
  Instances = ["*"]
  Measurement = "Processor"
  ObjectName = "Processor"
  WarnOnMissing = true
[inputs.win_perf_counters.tags]
  metricPath = "metrics"

[outputs]

[[outputs.cloudwatch]] force_flush_interval = "60s" namespace = "CWAgent" region = "eu-central-1" rollup_dimensions = [["InstanceId"]] tagexclude = ["host", "metricPath"] [outputs.cloudwatch.tagpass] metricPath = ["metrics"]

[processors]

[[processors.ec2tagger]] ec2_instance_tag_keys = ["aws:autoscaling:groupName"] ec2_metadata_tags = ["ImageId", "InstanceId", "InstanceType"] refresh_interval_seconds = "0s" [processors.ec2tagger.tagpass] metricPath = ["metrics"] 2023-08-02T09:53:45Z I! Starting AmazonCloudWatchAgent CWAgent/1.247360.0b252689 (go1.20.5; windows; amd64) 2023-08-02T09:53:45Z I! AWS SDK log level not set 2023-08-02T09:53:45Z I! Loaded inputs: statsd win_perf_counters 2023-08-02T09:53:45Z I! Loaded aggregators: 2023-08-02T09:53:45Z I! Loaded processors: ec2tagger 2023-08-02T09:53:45Z I! Loaded outputs: cloudwatch 2023-08-02T09:53:45Z I! Tags enabled: host=eim-t2b-app 2023-08-02T09:53:45Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"xxx", Flush Interval:1s 2023-08-02T09:53:45Z I! [processors.ec2tagger] ec2tagger: Check EC2 Metadata. 2023-08-02T09:53:45Z I! [logagent] starting 2023-08-02T09:53:45Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization. 2023-08-02T09:53:45Z I! [processors.ec2tagger] ec2tagger: Check EC2 Metadata. 2023-08-02T09:53:45Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization. 2023-08-02T09:53:45Z I! cloudwatch: get unique roll up list [[InstanceId]] 2023-08-02T09:53:45Z I! Started the statsd service on :8125 2023-08-02T09:53:45Z I! cloudwatch: publish with ForceFlushInterval: 1m0s, Publish Jitter: 1.374584607s 2023-08-02T09:53:45Z I! Statsd listener listening on: [::]:8125 2023-08-02T09:53:45Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeeded 2023-08-02T09:53:45Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes 2023-08-02T09:53:45Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeeded 2023-08-02T09:53:45Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes

Steps to reproduce Install v1.247360.0b252689 of Cloudwatch agent. Config from SSM param. Restart Windows, check service status / error logs.

What did you expect to see? The service should of started. Checking via the Run command: { "status": "stopped", "starttime": "", "configstatus": "configured", "version": "1.247360.0b252689" }

What did you see instead? The service did not start and errors were logged.

What version did you use? 1.247360.0b252689

What config did you use? Config: (e.g. the agent json config file) SSM param - AmazonCloudWatch-windows { "metrics": { "aggregation_dimensions": [ [ "InstanceId" ] ], "append_dimensions": { "AutoScalingGroupName": "${aws:AutoScalingGroupName}", "ImageId": "${aws:ImageId}", "InstanceId": "${aws:InstanceId}", "InstanceType": "${aws:InstanceType}" }, "metrics_collected": { "LogicalDisk": { "measurement": [ "% Free Space" ], "metrics_collection_interval": 60, "resources": [ "" ] }, "Memory": { "measurement": [ "% Committed Bytes In Use" ], "metrics_collection_interval": 60 }, "Paging File": { "measurement": [ "% Usage" ], "metrics_collection_interval": 60, "resources": [ "" ] }, "PhysicalDisk": { "measurement": [ "% Disk Time" ], "metrics_collection_interval": 60, "resources": [ "" ] }, "Processor": { "measurement": [ "% User Time", "% Idle Time", "% Interrupt Time" ], "metrics_collection_interval": 60, "resources": [ "" ] }, "statsd": { "metrics_aggregation_interval": 60, "metrics_collection_interval": 10, "service_address": ":8125" } } } }

Environment OS: Windows Server 2019 Datacenter

Additional context It seems because this fails: "E! [EC2] Cannot get EC2 Metadata from IMDS: EC2 metadata is not available." possibly due to things not being ready at start-up that it then fails the Agent start-up and it doesn't recover. I can workaround by setting service type to automatic delayed start but of course this is an extra step and I am not sure if this gets reset when I upgrade the agent.

Cumming5412 avatar Aug 02 '23 10:08 Cumming5412

Thank you for bringing this issue to our attention. We added a retry mechanism for imds calls in the latest release. In your common-config.toml you can increase the number of retries.

[imds]
  imds_retries = ${change this to a high number}

sethAmazon avatar Oct 05 '23 17:10 sethAmazon

This issue was marked stale due to lack of activity.

github-actions[bot] avatar Aug 13 '24 00:08 github-actions[bot]

Closing this because it has stalled. Feel free to reopen if this issue is still relevant, or to ping the collaborator who labeled it stalled if you have any questions.

github-actions[bot] avatar Sep 15 '24 00:09 github-actions[bot]