Resiliency by Design – Capture Resiliency Features as part of CALM
Feature Request
Develop CALM to capture those aspects of resiliency that are decided or influenced by architecture designs choices.
Description of Problem:
Designing systems for resiliency is a complex endeavour.
While it is easy to find literature on resiliency techniques and considerations, there seems to be a lack of practical ways of effectively applying resiliency considerations to architecture designs.
Potential Solutions:
CALM offers the opportunity to capture and persist resiliency considerations as part of system architecture designs and subsequent implementations and can develop into more:
• A structured, practical, and scalable guide for resiliency design. • Templated resiliency design options.
Leading to better resiliency capabilities: • Compare & contrast different resiliency design choices. • Development and identification of resiliency design patterns. • Improved resiliency measures. • Targeted resiliency testing.
Next Steps
Create a Framework to articulate Resiliency Requirements
Before a system can be declared "resilient", there needs to be an understanding of what the benchmark is - ideally expressed as a clear set of requirements that need to be met.
Here an outline of a potential framework to capture resiliency requirements:
Definitions
- Resiliency - The ability of a system to maintain an "acceptable level of service" in the face of adverse conditions.
- Acceptable level of service - The level of service agreed by stakeholders (for whom the system provides services to) that might exhibit some degradation compared to desired performance but is still accepted as performing its function (e.g. the service is not perceived to be fully stopping or halting, to suffer downtime or otherwise involve human intervention such as manual restore or recovery operations).
Scope
- System
- System's end of any relationships (communications, dependencies)
- Architecture of the System
- Resiliency
- Recovery
Taxonomy
- System Rating - custom defined - e.g. 1, 2, 3
- Requirement Applicability - custom defined - e.g. Must Have, Optional
- Requirement Type - custom defined - e.g. Policy, Implementation
- Requirement - custom defined.
Resiliency Requirements Framework (Example)
| System Rating | ||||
|---|---|---|---|---|
| Requirement Type | Requirement | 1 | 2 | 3 |
| Policy | The system has a clear (stakeholder agreed) defintion of the minimum acceptable level of service. | Must Have | Must Have | Optional |
| Implementation | The acceptable level of service definition is expressed quantifiably in terms of availability, latency, performance and integrity requirements. | Must Have | Must Have | Optional |
| Policy | The system has a clear (stakeholder agreed) definition of RPO | Must Have | Must Have | Optional |
| Policy | The system has a clear (stakeholder agreed) definition of RTO. | Must Have | Must Have | Optional |
| Policy | The system is portable to run on different platforms and vendor services. | Must Have | Optional | Optional |
| Policy | The system maintains back-ups of all critical data points. | Must Have | Must Have | Must Have |
| Implementation | Data Back-ups are taken every X hrs. | Must Have (X=2) | Must Have (X | Must Have (X |
Propose a standard set of Resiliency Requirements Definitions
This working group could/should propose standard resiliency requirement definitions to pick and chose from.
@rocketstack-matt @develontopia , happy to contribute ideas to this domain. let me know if there are separate discussions scheduled.
@charleyalpha789 one area that might be of interest is looking at: https://github.com/finos/architecture-as-code/issues/426
Have you attended a CALM monthly meeting? This might be a good place to start/get an introduction to some of the other collaborators