amazon-cloudwatch-agent icon indicating copy to clipboard operation
amazon-cloudwatch-agent copied to clipboard

Delete not-update logs after a certain amount of time. [memory leak]

Open khanhntd opened this issue 3 years ago • 4 comments

Description: Currently, CloudWatch Agent only stop monitors logs file only if auto_removal was turned on and that applies to the case when logs files were not updated for a certain amount of time. However, this would leads to an issue of increasing files that Cloudwatch Agent needs to monitor, which would end up run out of file descriptor (meaning cannot open any more logs files to monitor) in most of the OS (including Linux). An example would be

2021-10-12T12:18:16Z E! [outputs.cloudwatchlogs] Aws error received when sending logs to ./audit: NoCredentialProviders: no valid providers in chain
caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
SharedCredsLoad: failed to load profile, .
EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": dial tcp 169.254.169.254:80: socket: too many open files

Solution: Detect the newest changes when tailing and compare current time. If there was nothing changes regarding the log file, kill the tail process.

Alternative solution:

khanhntd avatar Mar 19 '22 01:03 khanhntd

The agent taking up file descriptors on the host can cause other issues on the host, right? On top of this I think we should consider applying a limit to the number of file descriptors we can have open if possible. Is it currently possible to monitor hundreds of files at once to eat up all of the fds on the host? Considering you need an open fd for making a TCP connection (I think?) the agent could cripple the overall host health

SaxyPandaBear avatar Mar 19 '22 17:03 SaxyPandaBear

The agent taking up file descriptors on the host can cause other issues on the host, right? On top of this I think we should consider applying a limit to the number of file descriptors we can have open if possible. Is it currently possible to monitor hundreds of files at once to eat up all of the fds on the host? Considering you need an open fd for making a TCP connection (I think?) the agent could cripple the overall host health.

In the long run and short run, yes it could possible that the agent could reach the fd limit on the host. For the solution, yes it could possible setting a limit on number of fd to monitor along the changes of removing un-updated file after a certain amount of time. What I am still considering is: instead of setting a limit on number of fd to monitor, should we also setting a limit of memory used for CWAgent to work it out (same as MA)? Needs to gather some thoughts before creating a PR for this though.

khanhntd avatar Mar 20 '22 02:03 khanhntd

Another thing that could help with this particular issue is to monitor the number of file handles on the host.

SaxyPandaBear avatar May 07 '22 15:05 SaxyPandaBear

IMO, I would agree that monitor the number of file handles would be a sub-task. The long-term solution process based on my thoughts would be: Step 1: Check number of files handles on the host by CWAgent Step 2: Divide into two separate processes:

  • If the monitor files have exceed the limit file descriptor, delete old-update files forcefully to monitor the new files
  • Check every {{abstract}} minute and delete {{abstract}} not-updated time files

khanhntd avatar May 08 '22 03:05 khanhntd