devops-automation icon indicating copy to clipboard operation
devops-automation copied to clipboard

Refine definitions of Observability Domain(s) and Introduce Architecture Observability as a Capability

Open mgasca opened this issue 2 years ago • 6 comments

Feature Request

Description of Problem:

Below image is from @rocketstack-matt shared at the London on-site accelerator Screenshot 2023-11-03 at 11 12 12

Refine definitions of Observability Domains

When discussing Observability I want to ensure that all conversational participants are on the same page so that we can have clear, concise and valuable conversations related to how Observability fits in to AasC

System Observability

In the diagram shared above Observability is specifically System Observability, which is the Domain of Obsesrvability that most people are familiar with. There are actually other Domains of Observability that are relevant here, I mention a couple below.

As we know, System Observability is a qualitative characteristic that describes how well we are capable of understanding the internals of our Systems by virtue of the external signals they provide. The three canoncal pillars of System Observability are Logs, Metrics and Distributed Traces. IMO however Logs are just a degenerate form of Events (although some may argue that Events should just be considered a fourth pillar alongside the other three).

System Observability only begins to provide value once we can derive and convey Actionable Insights. One way Actionable Insights can be realized is in terms of Monitors. Monitors use concepts such as Thresholds/Capacity to Trigger Actions. These Actions can be as simple as Alerting via Channels and as complex as triggering downstream automation pipelines that transform code or are part of a set of Actions that implement Self-Healing features (or more simply Elasticity) for Systems.

As it relates to Architecture as Code, System Observability will be one of the sources of Metrics that will drive Fitness Scoring calculations and Indicators for Threshold breach calculations in Monitors that serve as (Holistic, Continuous) Fitness Functions.

image

ref Software Architecture Metrics (Ciceri, Farley, Ford, Harmel-Law, et. al.)

Data Observability

Data Observability is how well we can understand our Data by virtue of signals that are generated regarding our Data.

There have actually been good Logs/Events for various data-stores/databases for a very long time, and structured much betteer that things like Application/Service Logs. Audit tables are one form, as are the bi-temporal aspects of bi-temporal stores. For ACID databases, the Transaction Logs are actually perfect Logs. Each Atom in a Time-Series Database is actually a Log/Event for that Database. For services/bounded contexts that leverage Event Sourcing, the Event Store is a set of perfect logs for that store.

The dual to System Observability's Distributed Tracing in the domain of Data Observability is Data Lineage, which is included under the Data Domain in the image from @rocketstack-matt above. DIstributed Traces (Spans) are signals that describe flow of logic over time, Data Lineage are signals that describe flow of Data over tme.

With regards to Data Metrics, what are the important Metrics and what shape do they take? It depends. What it depends on is what Insights we are trying to derive in order to fuel specific Actions. This is the same in all Observability Domains. This will likely include things like Freshness.

We will likely want a Capablity that subsumes Data Quality/Freshness/Consistency (is this the existing Data Catalog Capability in the image above?). In the same way that the actual signals from the System Observability domain will fuel Fitness Score calculations and System Fitness Functions, Data Observability will fuel Data Freshness etc. calculations and inform Monitors that implement Data Fitness Functions around things like Quality/Consistency.

Code Observability

Code Observability is how well we can understand our Code by virtue of signals that are generated regarding our Code.

Source Control Management systems, similar to databases/stores, have been providing well structured logs for a long time. We just ned to avail ourselves of them. Considerig GIT, even if a commit is not technically a simple diff, we can think of it as such. A Commit and it's Diffs are therefore the single Events/Log Line for Code Observability. So we have perfect logs for Code Observability in the form of GIT history.

I would suggest that the dual to Distributed Tracing in Code Observability is Merges/Pushes. These describe the flow of Code between Branches and Repositories over time.

Again the types and shape of Code Observability Metrics depends on what Insights we want to derive and what Actions we want the Insights to Trigger. In fact, I believe that the set of Metrics we care about may be completely different for each specialization of "X as Code". e.g. the Metrics for Architecture as Code may be different to those for Infrastructure as Code.

If we get Code Observability right, then we automatically get things like Architecture as Code Observability, Infrastructure as Code Observability, etc. Defining AasC Metrics for the Architecture version of DORA metrics (see this point in Andrew Harmel-Law's recorded talk A Commune in the Ivory Tower for reference. He talks there about ADRs in specific but I believe the same or similar can be derived for other Architecture Artifacts to measure our efficiencies) will give us insight into the efficiencies (and bottlenecks) in our Architecture Practices as they relate to producing Architecture Artifacts.

Introduce Capability for Architecture as Code Observability

I believe we want to define and implment a Capability for Observability that builds on the realization of the Domain(s) of Observability. The Domains are how we describe these Observability Domains layered on top of the base AasC Schema. The Capability is where we define the Shape of Metrics/Events we care about in the Observability Domains that will feed into our calculations (Data Freshness, System Fitness Scores) and feed into Monitors that implement Fitness Functions and finally that will serve as Triggers for entry into other Capabilities like Drift Detection

For eample, in an idealized future: A Commit Event/Log that introduces/makes a change to an Infrastructure as Code Artifact is used by a Monitor to Trigger an Action (workflow) to re-calculate Drift Detection. As part of this workflow the new IasC Artifact is fed through a Transformer, which produces an AasC Artifact following the schema we define. This Transformer potentially also pulls in other linked AasC Artifacts, like ADRs that specify what infrastructure is used in between Ingress Gateways and Service Instances, which would allow the Transformer to know which Relationships and Entitities to Remove/Compress or Add/Expand depending on which direction the Transformation is going in. Then this newly produced AasC Manifest is diff'ed against the existing AasC committted Manifest to determine and calculate Degree of Drift. The workflow may also:

  • branch/commit/push/PR this new AasC Manifest for review.
  • Publish Degree of Drift as a Metric/Event
  • Published Degree of Drift may trigger Alerts or Invalidate a Release, which is essentially a Monitor of Degree of Drift that triggers downsteram automation. A.k.a. a Fitness Function for Degree of Drift

mgasca avatar Nov 28 '23 17:11 mgasca