architecture-as-code Resiliency by Design – Capture Resiliency Features as part of CALM

Feature Request

Develop CALM to capture those aspects of resiliency that are decided or influenced by architecture designs choices.

Description of Problem:

Designing systems for resiliency is a complex endeavour.

While it is easy to find literature on resiliency techniques and considerations, there seems to be a lack of practical ways of effectively applying resiliency considerations to architecture designs.

Potential Solutions:

CALM offers the opportunity to capture and persist resiliency considerations as part of system architecture designs and subsequent implementations and can develop into more:

• A structured, practical, and scalable guide for resiliency design. • Templated resiliency design options.

Leading to better resiliency capabilities: • Compare & contrast different resiliency design choices. • Development and identification of resiliency design patterns. • Improved resiliency measures. • Targeted resiliency testing.

Next Steps

Create a Framework to articulate Resiliency Requirements

Before a system can be declared "resilient", there needs to be an understanding of what the benchmark is - ideally expressed as a clear set of requirements that need to be met.

Here an outline of a potential framework to capture resiliency requirements:

Definitions

Resiliency - The ability of a system to maintain an "acceptable level of service" in the face of adverse conditions.
Acceptable level of service - The level of service agreed by stakeholders (for whom the system provides services to) that might exhibit some degradation compared to desired performance but is still accepted as performing its function (e.g. the service is not perceived to be fully stopping or halting, to suffer downtime or otherwise involve human intervention such as manual restore or recovery operations).

Scope

System
System's end of any relationships (communications, dependencies)
Architecture of the System
Resiliency
Recovery

Taxonomy

System Rating - custom defined - e.g. 1, 2, 3
Requirement Applicability - custom defined - e.g. Must Have, Optional
Requirement Type - custom defined - e.g. Policy, Implementation
Requirement - custom defined.

Resiliency Requirements Framework (Example)

		System Rating
Requirement Type	Requirement	1	2	3
Policy	The system has a clear (stakeholder agreed) defintion of the minimum acceptable level of service.	Must Have	Must Have	Optional
Implementation	The acceptable level of service definition is expressed quantifiably in terms of availability, latency, performance and integrity requirements.	Must Have	Must Have	Optional
Policy	The system has a clear (stakeholder agreed) definition of RPO	Must Have	Must Have	Optional
Policy	The system has a clear (stakeholder agreed) definition of RTO.	Must Have	Must Have	Optional
Policy	The system is portable to run on different platforms and vendor services.	Must Have	Optional	Optional
Policy	The system maintains back-ups of all critical data points.	Must Have	Must Have	Must Have
Implementation	Data Back-ups are taken every X hrs.	Must Have (X=2)	Must Have (X	Must Have (X

Propose a standard set of Resiliency Requirements Definitions

This working group could/should propose standard resiliency requirement definitions to pick and chose from.

May 21 '24 10:05 develontopia

@rocketstack-matt @develontopia , happy to contribute ideas to this domain. let me know if there are separate discussions scheduled.

Aug 27 '24 15:08 charleyalpha789

@charleyalpha789 one area that might be of interest is looking at: https://github.com/finos/architecture-as-code/issues/426

Have you attended a CALM monthly meeting? This might be a good place to start/get an introduction to some of the other collaborators

Oct 07 '24 10:10 jpgough-ms