architecture-as-code icon indicating copy to clipboard operation
architecture-as-code copied to clipboard

Resiliency by Design – Capture Resiliency Features as part of CALM

Open develontopia opened this issue 1 year ago • 2 comments

Feature Request

Develop CALM to capture those aspects of resiliency that are decided or influenced by architecture designs choices.

Description of Problem:

Designing systems for resiliency is a complex endeavour.

While it is easy to find literature on resiliency techniques and considerations, there seems to be a lack of practical ways of effectively applying resiliency considerations to architecture designs.

Potential Solutions:

CALM offers the opportunity to capture and persist resiliency considerations as part of system architecture designs and subsequent implementations and can develop into more:

• A structured, practical, and scalable guide for resiliency design. • Templated resiliency design options.

Leading to better resiliency capabilities: • Compare & contrast different resiliency design choices. • Development and identification of resiliency design patterns. • Improved resiliency measures. • Targeted resiliency testing.

Next Steps

Create a Framework to articulate Resiliency Requirements

Before a system can be declared "resilient", there needs to be an understanding of what the benchmark is - ideally expressed as a clear set of requirements that need to be met.

Here an outline of a potential framework to capture resiliency requirements:

Definitions

  • Resiliency - The ability of a system to maintain an "acceptable level of service" in the face of adverse conditions.
  • Acceptable level of service - The level of service agreed by stakeholders (for whom the system provides services to) that might exhibit some degradation compared to desired performance but is still accepted as performing its function (e.g. the service is not perceived to be fully stopping or halting, to suffer downtime or otherwise involve human intervention such as manual restore or recovery operations).

Scope

  • System
  • System's end of any relationships (communications, dependencies)
  • Architecture of the System
  • Resiliency
  • Recovery

Taxonomy

  • System Rating - custom defined - e.g. 1, 2, 3
  • Requirement Applicability - custom defined - e.g. Must Have, Optional
  • Requirement Type - custom defined - e.g. Policy, Implementation
  • Requirement - custom defined.

Resiliency Requirements Framework (Example)

System Rating
Requirement Type Requirement 1 2 3
Policy The system has a clear (stakeholder agreed) defintion of the minimum acceptable level of service. Must Have Must Have Optional
Implementation The acceptable level of service definition is expressed quantifiably in terms of availability, latency, performance and integrity requirements. Must Have Must Have Optional
Policy The system has a clear (stakeholder agreed) definition of RPO Must Have Must Have Optional
Policy The system has a clear (stakeholder agreed) definition of RTO. Must Have Must Have Optional
Policy The system is portable to run on different platforms and vendor services. Must Have Optional Optional
Policy The system maintains back-ups of all critical data points. Must Have Must Have Must Have
Implementation Data Back-ups are taken every X hrs. Must Have (X=2) Must Have (X Must Have (X

Propose a standard set of Resiliency Requirements Definitions

This working group could/should propose standard resiliency requirement definitions to pick and chose from.

develontopia avatar May 21 '24 10:05 develontopia

@rocketstack-matt @develontopia , happy to contribute ideas to this domain. let me know if there are separate discussions scheduled.

charleyalpha789 avatar Aug 27 '24 15:08 charleyalpha789

@charleyalpha789 one area that might be of interest is looking at: https://github.com/finos/architecture-as-code/issues/426

Have you attended a CALM monthly meeting? This might be a good place to start/get an introduction to some of the other collaborators

jpgough-ms avatar Oct 07 '24 10:10 jpgough-ms