serverless-application-model icon indicating copy to clipboard operation
serverless-application-model copied to clipboard

Request for Comments: Automatic Alarms for resources declared in SAM template

Open sanathkr opened this issue 7 years ago • 5 comments

What are you proposing?

SAM helps simplify the definition of Lambda functions, API Gateway and associated configurations. But it does not provide yet any mechanisms to simplify operational aspects of your application. I am proposing adding the ability for SAM to automatically create Cloudwatch alarms for all resources in the template.

Example code

Globals:
  Function:
    Alarms:
      # All Lambda functions will get a Throttle CloudWatch alarm. If necessary, each function can override certain properties alarm as they need
       Throttles:
          Metric: "throttle"
          # Notify if the Lambda gets throttled in 5min interval
          Threshold: 1
          Timeframe: "5min"
          Notification: 
             Type: Email
             Address: [email protected] 

MyFunction:
  Type: AWS::Serverless::Function
  Properties:
      Runtime: nodejs4.3
      CodeUri: s3://bucket/key
      ...
     Alarms:
       Throttles:
             # Override the timeframe to 10mins for this function
            Timeframe: "10min"
       Errors:
          Metric: "my-custom-metric"
          Threshold: 100
          Timeframe: "1min"
          Notification: 
             Type: SNS
             Address: !Ref OpsSnsTopic

Alarms will also be available for API resources as well. It is just not shown in the above template.

Automatic alarms

In addition to the above syntax, I propose adding a short-cut for example, Alarms: "default" which is a special case where SAM will automatically create necessary alarms for each resource with reasonable default configuration that should "just work" out of the box.

"default" will expand to a Alarm Template that is pre-configured in SAM and will be publicly documented.

In the future, we could expose make Alarm Template functionality a first-class citizen allowing users to define their own alarm templates. So you could define the template once, and use Alarms: "my-prod-template" to let SAM automatically create alarms based on your template.

How would this work?

When an Alarms configuration is specified, SAM will automatically create AWS::CloudWatch::Metric for the specific resource and AWS::CloudWatch::Alarm on the metric with given configuration. Users will be able to refer to the generated Metric & Alarm resources in rest of the template using !Ref MyFunction.Metric.<metric-name> or !Ref MyFunction.Alarm.<alarm-name> shorthand syntax that SAM will provide.

Process

This Issue describes a very broad set of features. We need your feedback to understand what is important to you. Please comment with your thoughts and help us evolve the requirements into a spec that can be implemented.

  • [ ] Request for comments: In Progress
  • [ ] Spec design
  • [ ] Acceptance

sanathkr avatar Mar 17 '18 00:03 sanathkr

Generally, I like this. I think SAM can bring additional value by way of the operation story for building an application. I do have some concerns here are how does this impact big templates and having Alarms (possibly other operations things like dashboards) in the same template therefore same stack.

Impact on Big templates

SAM uses one stack to deploy resources, so for customers who are already near the 200 CloudFormation Limit this solution doesn't work. It could be problematic for customers who are nearing the limit already and then add CloudWatch Alarms.

With that said, I do think it makes sense for the initial schema but I think the design needs to expand a little to make sure we don't break larger templates. This leads into my next concern :)

Alarms in the same stack

I am for separation of stacks and strongly believe Alarms and other operational resources belong in their own stacks. This is mainly for blast radius reduction and to keep service deployments separate from updates to Alarms, Dashboards, etc. SAM only supports one stack and I think that is an actually blocker for this feature, obviously up for debate as I am just stating my opinion. From my own experiences, having the ability to have Operational Resources in their own stack was super helpful. I was able to update CW Alarms or Dashboards without having to worry about affecting our resources. This was invaluable and produced safe deployments for the team.

Some other thoughts on the design directly:

  • I see you do not have all the required properties for AWS::CloudWatch::Alarm. Will SAM be generating theses? How will SAM support 'ComparisonOperator' or 'Namespace'?

  • How will SAM support other AWS::CloudWatch::Alarm Properties? Useful ones that come to mind are Dimensions, TreatMissingData, Unit, etc. Are customers of this feature going to just loose this flexibility?

jfuss avatar Mar 19 '18 15:03 jfuss

I love this idea (and nudged Sanath about it). I see how we could do something similar for Alarms that we did for Policy Templates, providing some canned basic Alarms that catch and represent larger use-cases out of the box. Creating Alarm Templates against Lambda and API Gateway's default collected metrics would be a big step up.

On Impact on Big Templates

While I get the concerns about too large templates we face that limit no matter what and I don't think I'd use that as a reason to not add this capability. For many users their SAM template could be just a single function and alarming on the failure of it in someway could be valuable. We just need to be clear about how many resources do get created by adding these in.

On Alarms in the same stack

Overall I agree, but I think we are challenged with cross stack references and how best to handle them when creating alarms/dashboards/etc. That second step towards operational "rightness" is just too much work today and lacks clear automation. In situations like stacks launched with CodeStar and in Serverless App Repo these easier Alarms built in could make the "1 click" app with proper ops requirements like Alarms more of a reality.

teknogeek0 avatar Apr 03 '18 15:04 teknogeek0

For someone like me, who adds at lest 2 CloudWatch Alarm per Function, this would be of great help.

On Impact on Big Templates

Are CloudWatch Alarms the problem here? Sooner or later you hit 200 resources anyway. I believe that it should be somehow possible to split an API into multiple stacks but that's another topic on its own.

On Alarms in the same stack

I believe that alarms should be in the same template.

  1. It makes referencing easy
  2. I see the defined alarms next to the resource they belong to.

Regarding the blast radius argument: how can a CloudWatch Alarm affect my running system? I can only think of that I can not deploy the stack if CloudWatch is somehow down?

Besides that, I don't see people putting alarms in separate templates. Is this common?

Event sources matter

Depending on the event sources, needed alarms differ. I would like to see alarm configuration based on the events of a lambda function. Examples:

  • If a Lambda is connected to an API Gateway, AWS/Lambda Errors and AWS/Lambda Throttles should result in 5XX on the API Gateway. So it could be enough to enable CW Metrics for the API resource and add an Alarm on AWS/ApiGateway 5XXError.
  • If a Lambda is connected to a Kinesis stream we need one CW Alarm per AWS/Lambda Errors, AWS/Lambda Throttles, and AWS/Lambda IteratorAge.
  • If a Lambda is connected to a ClouWatch Event one CW Alarm per AWS/Events FailedInvocations, AWS/Events ThrottledRules should be fine.

DynamoDB

AWS::Serverless::SimpleTable is worth a few Alarms (AWS DynamoDB ThrottledRequests) too. ConsumedRead/WriteCapacityUnits are a bit harder because we need to know the provisioned capacity which can be dynamic because of auto scaling (auto scaling is not yet supported by SAM, CW Math can help in the future).

michaelwittig avatar Apr 09 '18 13:04 michaelwittig

On Impact on Big Templates

Sure you can say that someone will hit the 200 limitation anyways but that doesn't mean we should just ignore it because they can hit it. What I want us to think about is how we go to a multiple stack world. If we design features with ignoring these limitations, then we can pigeon hole ourselves into not being able to support the feature in multiple stacks. Migrating stacks isn't simply and you normally hit this down the road when you least expect it.

I brought this up because customers are already running into these limitations and SAM currently doesn't support nested stacks (this is a current limitation unfortunatly). So in not considering large templates at all, we alienate these customers more.

@teknogeek0 While many customers probably do have a single function, I don't believe that is the normal. I would assume/believe an overwhelming majority of customers are creating and developing production applications, which require (in most cases) more than one function. We shouldn't be designing features that only look at the small cases, this is an oversight and can lead to 'one way doors'.

I didn't mean to bring this up as, don't do this feature because of large templates. It is meant to make sure we consider these customers in our design so when someone says, I am running into limitations, we can advise them (through docs?) on how to overcome this or maybe go deeper into supporting multiple stacks. Either way, I think the exercise during design time to consider these customers is what we should be doing.

On Alarms in the same stack

I take back my 'this is a blocker', I didn't consider SAR or CodeStar initially (great call out @teknogeek0). Everyone has a different opinion on what is best here, so we should be able to support both in whatever way that means (seeing a greater and greater case to tackle multiple stack world but think we should still exercise our brains so we don't lock ourselves into something, unknowingly).

I have used SAM in a couple projects and was able to use CW Alarms and dashboards in a separate stack without any problems. Once they are baked into AWS::Serverless::Function, you get locked into one template (since they are all the same resources). I recognize this is a separate problem that we might need to solve separately but we should have at least an understanding of this and be mindful when designing new features.

@michaelwittig I don't want to distract from the feature request here. Come ping me on the #sam-dev slack channel if you want to discuss further about blast radius. But for clarity here, they should be isolated yes but you stack will still be updated and you risk something happening to your other resources. When you pull resources out into their own stacks (like ops related things), you don't need to worry about anything. You know that the stack is completely scoped and has nothing to do with your service (or things powering your service).

jfuss avatar Apr 09 '18 15:04 jfuss

Hey @jfuss, wanted to ask few questions how you integrated cw/alarms and dashbarods with sam apps? Are you on slack somewhere? The link from this repo README doesn't work https://join.slack.com/t/awsdevelopers/shared_invite/enQtMzg3NTc5OTM2MzcxLTIxNjc0ZTJkNmYyNWY3OWE4NTFiNzU1ZTM2Y2VkNmFlNjQ2YjI3YTE1ZDA5YjE5NDE2MjVmYWFlYWIxNjE2NjU

Thanks!

OperationalFallacy avatar Sep 18 '20 18:09 OperationalFallacy

Is there any work being done on this? I see it hasn't been updated for a while. I often tend to create a generic sub-stack template that I re-use across services which alarms since the alarm config itself is quite verbose. Macros is another option of course with similar outcome but fairly often both a nested stack and a macro is too much work to setup to make it worthwhile and I end up with a SAM template where the alarms alone account for 60-70% of the template.

It would be great with a simple way to deal to create standard alarms like Throttles, Number Concurrent Invocations & Lambda Errors. I especially like the Canned suggestion by @teknogeek

carlnordenfelt avatar Nov 11 '23 17:11 carlnordenfelt