WorryFree Computers »
Address
:
[go:
up one dir
,
main page
]
Include Form
Remove Scripts
Accept Cookies
Show Images
Show Referer
Rotate13
Base64
Strip Meta
Strip Title
Session Cookies
Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PWL NY: Simple Testing Can Prevent Most Critical Failures
Search
Caitie McCaffrey
June 14, 2016
Technology
8
400
PWL NY: Simple Testing Can Prevent Most Critical Failures
Caitie McCaffrey
June 14, 2016
Tweet
Share
More Decks by Caitie McCaffrey
See All by Caitie McCaffrey
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
323
20k
The Path Towards Simplifying Consistency in Distributed Systems
caitiem20
1
220
Argus Papers We Love
caitiem20
13
1.1k
The Verification of a Distributed System
caitiem20
22
2.1k
We Hear You Like Papers: Eventual Consistency
caitiem20
14
720
The Verification of a Distributed System
caitiem20
12
680
The Verification of a Distributed System
caitiem20
6
690
A Brief History of Distributed Programming: RPC
caitiem20
31
6.3k
Building Scalable Stateful Services
caitiem20
12
1.3k
Other Decks in Technology
See All in Technology
【Λ(らむだ)】2023年下期 アプデ情報 / RPALT20240530
lambda
0
260
Dompter le chaos de l'information : Construire un Allié IA avec Langchain4J
magnette
0
120
CyberAgent AI事業本部2024年度MLOps研修実践編 / MLOps Practice
hosimesi11
4
5.5k
20240530 Backlogでスクラムを回してみよう
masaruogura
0
110
自分の学習データで画像生成AIを使ってみる話
moyashi
2
100
入門 電気通信事業者
kurochan
8
4.3k
Unified Diff 形式の差分から Go AST を構築して feature flag を自動計装する
biwashi
7
660
Kubernetesで作るIaaS基盤/KubeVirt Deep Dive
oracle4engineer
PRO
9
1.9k
『インタプリタの作り方』の紹介 / Let's enjoy crafting interpreters
mktakuya
0
280
IoTサービスにおけるSLI設計とLUUPでの実践
grimoh
1
140
会社概要_DMS製品紹介
ryoheig0405
0
190
「メタスキル」を意図的に使おう ~自律的なチームを育むマネージャーのセルフマネジメント~/Use Metaskills consciously
ohnoeight
0
340
Featured
See All Featured
Designing Dashboards & Data Visualisations in Web Apps
destraynor
226
51k
Producing Creativity
orderedlist
PRO
338
39k
Building Effective Engineering Teams - LeadDev
addyosmani
38
2k
Build your cross-platform service in a week with App Engine
jlugia
227
17k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
660
120k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
87
45k
GraphQLとの向き合い方2022年版
quramy
34
13k
Practical Orchestrator
shlominoach
183
9.8k
Clear Off the Table
cherdarchuk
87
310k
Docker and Python
trallard
36
2.8k
How STYLIGHT went responsive
nonsquared
92
4.9k
Web Components: a chance to create the future
zenorocha
306
41k
Transcript
Simple Testing Can Prevent Most Critical Failures: An Analysis of
Production Failures in Distributed Data-Intensive Systems Papers We Love New York - June 2016
Caitie McCaffrey @caitie Distributed Systems Engineer CaitieM.com
None
None
Analyzed Failures in Real World Systems
“A majority (77%) of failures require more than one input
event to manifest, but most of the failures (90%) require no more than 3” Complexity of Failures
“The specific order of events is important in 88% of
the failures that require multiple events Complexity of Failures
“3 Nodes or less can reproduce 98% of Failures” Complexity
of Failures
Unit Tests “A majority of production failures (77%) can be
reproduced by a unit test”
Top Down Fault Injection & State Space Exploration is Expensive
Logging • 76% of the failures print explicit failure- related
error messages • For 84% of the failures, all of the triggering events are logged • Logs are noisy: each failure prints 824 log messages (median)
Catastrophic Failures
Error Handling • 92% of failures were the result of
incorrect handling of non-fatal errors • 58% of faults could have been detected via simple testing • 35% of failures caused by bad practices in error handling code
• Error Handling Code is simply empty or only contains
a Log statement • Error Handler aborts cluster on an overly general exception • Error Handler contains comments like FIXME or TODO Bad Practices
Aspirator Performs static analysis of Java bytecode to detect: •
error handler is empty • error handler over-catches exceptions and aborts • error handler contains phrases like “TODO” or “FIXME”
• 500 New Bugs & Bad Practices • 115 Fasle
Positives • 171 bugs reported • 143 bugs confirmed or fixed Aspirator Results
-developer “I fail to see the reason to handle every
exception” Developer Reactions
“It is often much harder to reason about the correctness
of a system’s abnormal path than its normal execution path ”
Moving Forward • Use a tool like Aspirator that is
capable of identifying trivial bugs • Enforce code reviews of error handling code • High code coverage on error handling code
Questions @caitie