So you’ve decided to study for Azure certifications and you’re interested in learning more about resiliency. Whether you’re learning about individual cloud products, or combining them together, you’ll likely come across the whitepaper Resilience on Azure.
See the parent post, A Review of Resiliency in Azure for a broader discussion of this whitepaper.
A Better Background in Resiliency
Unfortunately, this whitepaper jumps directly into basic resiliency concepts without regard of what type of engineer or architect you are. It attempts to cover resiliency concepts across Infrastructure, Data, and Applications all at once, and without much regard to other ways enterprises may be organized. It also skips hierarchies and types of “Resiliency”.
Essentially, “Resiliency” is a cross-function, cross-org, cross-displine concept. There’s many areas to improve “resiliency” ranging from Change Management to Design Patterns. You can be in Engineering, Operations, or Architecture and resiliency will still apply for you (though differently).
Resiliency in software is going through re-definition and clarification (See the Systems Engineering Book of Knowledge). In some ways, the software world is reacting to increased need and focus on resiliency through Resiliency Engineering, Chaos Engineering, Site Reliability Engineering, or any other discipline that focuses on addresses failures across platforms and dependencies as a result of this cloud-based distributed software and human system. As a practice and discipline, both definition and tooling in Resiliency Engineering is still evolving.
As far as this paper is concerned, it should provide background into Resiliency and align to an industry definition.
What are you solving for when you’re solving “Resiliency”
There’s a lot of performance metrics when understanding a system, and there’s been many attempts to define “resiliency”. According to the Systems Engineering Book of Knowledge (SEBoK), one definition is the ability to provide required ‘capability’ in the face of ‘adversity’.
Capability could be roughly understood as the expected behavior of the system, and adversity roughly understood as anything that would impact expected behavior. Expected behavior can range from functional (e.g. user interaction) and non-functional (e.g. system performance) aspects of the system, and across technical domains (e.g. application, data, infrastructure, and others).
Common Strategies for Improving Resiliency
MITRE is one of the leaders in Systems Engineering and a large contributor into Systems Engineering research. They define System Resilience as follows (and translated to Non-Functional Requirements software developers know about)
- Avoid Adversity (click for my interpretation): How do you know adversity is going to happen, and what are proactive measures you can take to avoid system and user-facing issues.
- Withstand Adversity: When adversity has occurred, what does the technical system do to continue to provide a technical service and/or user experience?
- Recover from Adversity: When adversity has occurred, what does the technical system and/or human process to recover in a manual and/or automated fashion?