From here to resilience - a travel guide

May 30, 2024

290

From here to resilience - a travel guide

This slide deck starts with the observation that many companies claim to be resilient but only few of them really are.

Then a prototypical journey of a regular company IT department from leaving availability to operations to a truly resilient organization is laid it. Along the way several interim stops are discussed in terms of their goals, leading questions, typical measure and tradeoffs until the peak of advanced resilience is reached eventually.

Additionally, it is discussed if it is always necessary to aim for the peak or if one of the interim stops also may be okay depending on the context. Finally, a quick oversight is presented that can help to figure out where an organization currently is regarding resilience.

As always, the voice track of the presentation is missing. Nevertheless, I hope it still is useful for you and gives you a few ideas to ponder on your own journey towards resilience.

Uwe Friedrichsen

May 30, 2024

More Decks by Uwe Friedrichsen

See All by Uwe Friedrichsen

Beyond the saga pattern

ufried

420

Where do we go from here? – Mastering the changed needs of architectural work

ufried

320

The future is already here – Mastering the challenges of the coming years

ufried

440

The reusability fallacy

ufried

760

Patterns of sustainability - Going green in IT

ufried

Resilient software design - The past, the present and the future

ufried

230

Digital products are different! Are they?

ufried

110

Road-movie architectures

ufried

1.2k

Becoming a cloud native ... genuinely!

ufried

510

Other Decks in Technology

See All in Technology

Goでテストをしやすくするためにやったこと

kazukihayase

370

シビックテックによる、社会と民主主義のアップデート

halsk

220

AWS Storage Gatewayで始めるセキュアなデータ連携 / Secure data linkage with AWS Storage Gateway

yuj1osm

170

手を動かさないインシデント対応〜自動化で迅速・正確な運用を目指す〜

jacopen

250

内製したSlack Appで頑張るIncident Response@Waroom Meetup #1 / Incident Response with Slack App in 10X

sota1235

360

20240530_IBMTechXchangeDojo_いまからでも遅くない_OpenShiftでアプリをHTTPSで公開してみる

ttykwn

120

AWS Control Towerと HashiCorp Terraformでいい感じにマルチアカウント管理をしよう

chazuke4649

310

Gemini in AppSheet_吉積情報株式会社石見

comucal

PRO

1.8k

オブザーバビリティ勉強会で模擬障害対応をやってみた

leveragestech

200

RAG の研究を元に予測する、これからのエンジニアに求められるスキル

isseihamada

160

タイパ重視×アウトプット駆動ではじめるAWS 〜認定資格とコミュニティの先で考えるAWSの学び方〜

maimyyym

130

2024/05/30 機械学習モデルの評価と改善発表資料

masakick07

230

Featured

See All Featured

Let's Do A Bunch of Simple Stuff to Make Websites Faster

chriscoyier

501

140k

CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again

sstephenson

155

14k

Designing on Purpose - Digital PM Summit 2013

jponch

112

6.5k

Automating Front-end Workflow

addyosmani

1357

200k

[RailsConf 2023] Rails as a piece of cake

palkan

4.1k

Mobile First: as difficult as doing things right

swwweet

218

8.7k

The Straight Up "How To Draw Better" Workshop

denniskardys

228

130k

GraphQLの誤解/rethinking-graphql

sonatard

9.4k

How GitHub Uses GitHub to Build GitHub

holman

471

290k

Become a Pro

speakerdeck

PRO

4.6k

Put a Button on it: Removing Barriers to Going Fast.

kastner

3.1k

Building Effective Engineering Teams - LeadDev

addyosmani

Transcript

From here to resilience A travel guide Uwe Friedrichsen –
codecentric AG – 2013-2024
Uwe Friedrichsen Works @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/
Our IT is resilient!
Is it?
resilience The ability to successfully cope with adverse events and
situations, including 1. handling expected adverse events and situations (robustness) 2. handling unexpected adverse events and situations (surprise) 3. improving due to adverse events and situations (anti-fragility) resilient software design Designing and building software-based systems in ways that improve their dependability and thus support resilience according to the definition above
Sources of failure (examples) • Hardware failure • Central process
becomes latent • Firmware bug in infrastructure component • Cyberattack • Critical software bug • Triple redundant data center cooling fails at once • Competitor launches a disruptive new product • …
Is your IT prepared to handle all those sources of
failure (and many more) swiftly, successfully and gracefully?
Most likely not …
How can we become resilient?
It is not a nice, paved road – sorry
It is rather a mountain climb
Let us explore the path together
None
None
Valley of feature-completeness • Status quo for many organizations •
Dev is responsible for feature delivery • Ops is responsible for availability • Dev budget is reserved for implementing business features • NFRs besides maintainability are “outsourced” to ops
Valley of feature-completeness • Core driver • Maximize business feature
throughput • Leading questions • Is the business requirement implemented correctly? • How can we implement features faster?
Valley of feature-completeness • Typical measures • Everything ops can
influence • Redundant hardware and infrastructure components, load balancer with failover, cluster, HA hardware • Strict handover rules for dev artifacts • Long pre-production testing phases to check for potential production problems
Valley of feature-completeness • Trade-offs • Nice for Dev (less
to take care of) • Not so nice for Ops (expected to run software reliably that was created without availability as quality goal)
Valley of feature-completeness • When to use • Monolithic and
isolated IT systems, exchanging data via batch interfaces • When to avoid • Distributed, interconnected system landscapes, communicating over online interfaces (the default today)
Valley of feature-completeness • Blind spot • Everything besides business
features (and maintainability) • Reality of today’s system landscapes • Availability is treated as SEP * * Somebody else’s problem
Ops cannot ensure availability alone
None
Plateau of stability • Core driver • Avoid failure •
Leading questions • How can I avoid the failure of my application/service? • How can I detect a failure and automatically fail over? • How can I avoid an overload situation? • How can I detect an overload situation and fix it by automatically scaling up?
Plateau of stability • Typical measures • Redundant service deployment
• Timeout, error checking, retry, circuit breaker, failover • Rate limiting, back pressure • Autoscaling • Measures often statically preconfigured, utilizing middleware whenever possible • Focus on technical measures only
Plateau of stability • Impact radius • Technology only •
Collaboration modes • Valley of feature-completeness à Ops alone • Plateau of stability (basic) à Dev | Ops (mostly independent) • Plateau of stability (advanced) à Dev & Ops (feedback loop)
Plateau of stability • Trade-offs • Relatively easy to reach
• Often supported by middleware and infrastructure means • Quite good availability achievable • If system parts fail, recovery (and detection) often takes long • Works best with ops-dev feedback loop • Works good with an economies of scale business model
Plateau of stability • When stability is fine • Not
too high availability needs (< 3 Nines) • Planned downtimes possible • System is not distributed internally • When stability is not sufficient • Higher availability needs (> 3 Nines) • System distributed internally (e.g., microservices) • Safety-critical systems
Plateau of stability • Blind spot • 100% available trap
Falling in the “100% available” trap
An example You: “How do you handle the situation if
the service you call does not respond (or does not respond timely)?” Developer 1: “We did not implement any extra measures. The other service is so important and thus needs to be so highly available that it is not worth any extra effort.” Developer 2: “Actually, if that service should be down, we would not be able to do anything useful anyway. Thus, it just needs to be up.”
Variants of the trap • Infrastructure components will never fail
• E.g., OS, schedulers, routers, switches, … • Middleware components will never fail • E.g., message queues, databases, … • All encompassing applications and services will never fail • No message loss, latency, response failures, …
The “100% available” trap in a nutshell “Everything works perfectly,
all the time. Nothing ever fails.” Successor of the “Ops is responsible for availability” mindset
Continuous partial failure is the normal state of affairs. --
Michael Nygard Source: https://www.cognitect.com/blog/2016/2/3/the-new-normal-failure-is-a-good-thing
Everything fails, all the time. -- Werner Vogels
Failures are inevitable
Availability = MTTF MTTF + MTTR MTTF: Mean Time To
Failure MTTR: Mean Time To Recovery Our overall aim is to maximize availability Stability thinking is assuming that MTTF can be increased unlimited and thus MTTR can be ignored Robustness thinking is accepting that increasing MTTF is limited and thus MTTR must be reduced to further increase availability
There are failure modes beyond crashes and overload situations
Failures modes (excerpt) • Crash failure • Overload failure •
Omission failure • Timing failure • Response failure • Byzantine failure • Software bugs • Firmware bugs • Security vulnerabilities • …
Effects of failure modes (excerpt) • Lost or incomplete messages
• Duplicate messages • Latency up to complete standstill • Out-of-order message arrival • Partial, out-of-sync local memory • Split brain • Persistent malfunction • Data corruption or loss • Confidential information leak
None
Plateau of robustness • Core driver • Maximize availability (embrace
failure) • Leading questions • What can go wrong and how can I respond to it? • What can I do if a remote service is not available? • How can I detect and handle invalid requests (when being called) and responses (when calling)? • How can I fix bugs and other defects quickly?
Plateau of robustness • Typical measures • Fallback • Complete
parameter checking • Minimize startup time • Deployment automation (CI/CD, IaC/IfC, …) • Application and business level monitoring • Focus extended to business domain
Plateau of robustness • Trade-offs • More effort needed to
reach • Affects not only systems, but also processes a bit • Change of mindset required (from avoid failures to embrace failures) • Tight ops-dev collaboration required • Very high availability achievable • Works good with an economies of speed business model
Plateau of robustness • When robustness is fine • Higher
availability needs (> 3 Nines) • System distributed internally (e.g., microservices) • When robustness is not sufficient • Safety-critical systems • Very high availability needs in highly uncertain technical environments • High innovation speed required in highly uncertain business environments
Plateau of robustness • Impact radius • Technology, business domain,
touching processes • Blind spot • The limits of perception
Surprises are inevitable
Known knowns Things we know and are aware of We
usually take these topics into account Unknown knowns Things we implicitly know but are not aware of Known unknowns Things we do not know and are aware of that we do not know them Unknown unknowns Things we do not know and are not aware of that we do not know them We definitely miss these topics We may take these topics into account We may be aware we ignored these topics
Surprises cannot be handled at the technical system level alone
Socio-technical system The IT systems and the encompassing organization creating,
running and changing them Technical system The IT system landscape Suitable to respond to adverse events and situations including surprises (resilience) Suitable to respond to expected adverse events and situations (robustness)
None
High-plateau of basic resilience • Core driver • Expect the
unexpected • Leading questions • How can I maximize the odds of detecting and responding quickly to an unexpected error before it becomes a failure? • Which resources does my IT organization need to be able to respond quickly and successfully to adverse surprises? • How can I organize best to be able to respond quickly and successfully to adverse surprises? • How do I balance resilience and efficiency?
High-plateau of basic resilience • Typical measures • Self-organized teams
• Fire drills & chaos engineering • Slack in the system • Observability • Organic computing and residuality theory may support • Focus extended to whole socio-technical system
High-plateau of basic resilience • Trade-offs • High effort needed
to reach • Affects the whole socio-technical system • Usually needs to reshape collaboration at system boundaries • Allows for reliable very high availability even in the face of unexpected adverse situations • Enables very high innovation speed without compromising dependability even in highly uncertain environments
High-plateau of basic resilience • When to use • Safety-critical
systems • Very high availability in highly uncertain technical environments • High innovation speed in highly uncertain business environments
High-plateau of basic resilience • Impact radius • Technology, business
domain, processes, organization • Focus on withstanding and quick recovery • Blind spot • Standing still
We need to leverage our learnings to continuously improve
Withstand Resist adversities Adapt Learn and improve Recover Quickly recover
Transform Radically change Covered by basic resilience Resilience response types Covered by advanced resilience (Anti-Fragility)
None
Peak of advanced resilience • Core driver • Adversity as
opportunity to improve • Leading questions • How do I need to adapt at all levels to improve my ability to handle adverse situations successfully? • Is adaptation enough or do I need a more radical change to reduce my vulnerability to adverse situations? • How can I establish a continually learning and improving organization? • How do I need to shift and change my organizational boundaries to become less vulnerable to adverse situations?
Peak of advanced resilience • Typical measures • Culture of
continuous learning and improvement • System thinking (improving the system, not only the parts) • Leapfrogging • Perception shift from nuisance to opportunity • Impact radius • Technology, business domain, processes, organization • Focus on all resilience response types (as needed)
Peak of advanced resilience • Trade-offs • Effort required comparable
to basic resilience • Requires different mindset regarding adverse situations • Adverse situations make you stronger • Will eventually affect the whole company (won't stop at the boundaries of the IT organization) • When to use • Prepare for a successful endless game in an increasingly uncertain world ("VUCA")
Yay! We made it to the peak!
Does it always have to be the peak?
Plateau of stability Plateau of robustness High-plateau of basic resilience
Peak of advanced resilience Visited Core driver Blind spot When (not) to go for it Valley of feature- completeness Maximize business feature throughput Availability is SEP • Okay for isolated systems • Not advisable for distributed, online communicating system landscapes (which is the norm) Avoid failure 100% available trap • Not too high availability demands (< 3 Nines) • Planned downtimes possible • System not distributed internally Maximize availability Limits of perception • High availability demands (> 3 Nines) • System distributed internally (e.g., microservices) Expect the unexpected Standing still • Safety-critical systems • High availability in unpredictable technical environments • Uncertain business environments Adversity as opportunity to improve • Successful endless game in an increasingly uncertain world Remember: Business and IT have become inseparable —
Depending on your task and your needs, you do not
always need to aim for the peak
However, in the long run only those will thrive and
survive in an increasingly VUCA world who aim for the peak Remember: Business and IT have become inseparable
How can I check where I am?
Collaboration Availability Failures Responses Systems Surprises Feature- complete Stability Robustness
Basic resilience Advanced resilience Impact — Ops alone — — — Technical — Business logic Processes à Minimize MTTR All failure types à à Recover + + + à à à à à à Adapt Transform + Technology Dev | Ops Dev + Ops Maximize MTTF Crash Overload Known à Withstand Processes Organization à à à Unknown Socio-technical + + à
Wrap-up
Wrap-up • Resilience is not what you think it is
• The difficult path up Mount Resilience • How far to go and when it is okay to stop • Understanding where you are
How far will you climb up?
Uwe Friedrichsen Works @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/