Below you will find pages that utilize the taxonomy term “Operational Excellence”
Posts
Operational Excellence: Limit Blast Radius
Trying to prevent all failures is a foolish endeavor. This doesn’t mean you’re not supposed to prevent and reduce failures as much as you can, but limiting the impact of failures turned out to be the most impactful strategy I found to improve Operational Excellence so far. In this article, we’ll build a taxonomy of failure types and explore some strategies to limit the impact of these failure groups on your software system.
Posts
Operational Excellence: Learning from failure
Failure is inevitable. But every failure is an opportunity to learn, and to harden your system against more catastrophic failures later down the road. Blameless post-mortems are a very good tool to achieve this and have been adopted quite widely in the industry already. But of course, the details and diligence of this process vary a lot, so I’ve taken a stab at putting together all I know about how post-mortems can be great.