How we do postmortems and incident review at Xfers

Victor Liew
Xfers Engineering
Published in
3 min readDec 24, 2019

--

Xfers has been embarking on various infrastructural projects to meet the increasing demands and loads placed upon our systems. As we scale and increase the complexity of our architecture, the risks of encountering a system outage due to bugs, or inadequate engineering processes — a teething issue that startups often face becomes more and more apparent.

In this article, I want to share more about how we approach post mortems at Xfers. Post mortems should be one of the most straightforward software engineering processes that every startup should look to implement to prevent the repeat of recurring mistakes.

To this end, I want to highlight the importance of having a blameless post-mortem process as explained by John Allspaw @ Etsy:

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

  • what actions they took at that time,
  • what effects they observed,
  • expectations they had,
  • assumptions they had made,
  • and their understanding of the timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.

The primary objective of a post mortem should be about gaining additional clarity on how the system functions or engineering processes lapse which we might not have thought of and the prevention of future similar incidents. A useful incident review should tell a story about what has happened and a chronology of what has happened. As such, we should write an incident report not later than five days after the incident has occurred while the event is still fresh in mind.

Here’s a simple template that we use internally in our wiki-page.

  1. Executive Summary of the incident that we should note down.
  2. Analysis of root cause which triggered the event along (and other discovery and realisations found)
  3. Impact of the event
  4. Measures taken to address the root cause and consequences of the event
  5. Create any follow-up action tickets and note down topics for discussion.
  6. How often had a similar incident occurred over the last 3 years?
  7. Severity: Measured by the number of customers affected.

To make things simple, we have created a template that people should follow once an incident has occurred

Each incident is then filed neatly in a tabular form. Here’s an example of how it looks like internally:

The owner for an incident review is usually determined by the following order:

  1. The person who caused the event should write out the incident review report
  2. If the cause is unknown, it will be up to the team to write the incident review. The team lead or engineering manager should be responsible for ensuring that an incident review is written.
  3. Lastly, the CTO should be the one ensuring that there’s an owner for the incident report. Should there be no owner, the CTO should be the owner for the above process.

Why do AARs?

In my opinion, the real value of postmortems comes from building a positive culture of ownership, sense of responsibility, urgency and frequent improvements.

Other side benefits include:

  1. Having a clear communication process when an incident happens so that teams from the various departments can have an official meeting to ask questions about the incident and to give their inputs as opposed to having groups of people setting up individual meetings with one another which might result in time wastage.
  2. A repository of learnings from the past that will help inform certain architectural and product decisions. Teams from other departments will also be able to be kept up to date from a central repository with confidence that the AAR process will be carried out and note are taken

--

--