Common approaches to post-incident reviews are often short-sighted in their focus and rarely bring about any real improvements to our overall systems.
This talk will provide insight into new ways teams are analyzing incidents in retrospect in order to continuously improve system uptime.
Many organizations have found great value in retrospective analysis following incidents that impact the reliability and availability of a service. Commonly known as post-incident reviews or postmortems, companies routinely analyze what went wrong in retrospect. This talk will point out the true value of a post-incident review as well as how to perform them for maximum exposure of improvements for every organization’s people, process, and technology.
Let’s explore a deeper understanding of failure in complex systems and key metrics leveraged to consistently improve the availability and reliability of systems. Jason will point out common flaws in the way many organizations approach retrospective analysis of outages and service disruptions as well as uncover areas often overlooked during a retrospective (such as what were engineers thinking when they made the decisions they made).
Pulling from the new O’Reilly Media book “Post-Incident Reviews: Learning From Failure for Improved Incident Response”, the audience will walk away with a broader understanding of their purpose and how to get started on a new path towards continuously improving the uptime of systems and services. Jason will also provide a template for audience members to take back to their teams to use as a starting point for a new approach.
Audience challenges & takeaways: