How to Run a Better Postmortem

2026-04-05root-cause-analysispostmortemengineering-processincident-management

There's a meeting every engineering team runs after something breaks. It has a name. It has a template. It has a facilitator who says "let's focus on the timeline" and an engineer who says "I think we need better monitoring."

Everyone nods. Action items are assigned. The doc goes into a folder nobody opens. Six months later, a structurally identical incident happens. The same meeting runs again.

This is the postmortem ritual. It follows a script: timeline, contributing factors, action items, close. The format looks rigorous. The output is performative. The action items are safe. "Add monitoring." "Improve documentation." "Create a runbook." Nobody's career is threatened by a runbook.

I've sat through hundreds of these. The pattern is always the same. The room converges on the cause that's easiest to fix, assigns it to whoever's least likely to push back, and calls it done. The ritual creates the illusion of learning without the discomfort of actual learning.

Why "blameless" became "toothless"

Blameless postmortems were a genuine improvement. The old model -- find the person who made the mistake, punish them -- was destructive. It drove incidents underground. People hid errors instead of reporting them. Removing blame was necessary.

But something got lost in the translation.

"Blameless" was supposed to mean "safe to be honest about what happened." In practice, it became "nobody is accountable for the system that produced this." The postmortem stops where discomfort starts. The process failure gets noted. The incentive misalignment doesn't. The organizational structure that made the incident inevitable -- nobody mentions it because that's not "blameless," that's "career-limiting."

Blameless doesn't mean toothless. It means the person who made the mistake can speak freely about why the system set them up to make it. It means asking "what about our process made this outcome likely?" instead of "who do we need to retrain?"

That's a harder conversation than most postmortems are willing to have. It's also the only conversation that produces change.

The comfort problem

Every postmortem has a moment. Someone says something close to the real cause. The room gets quiet. Then a senior person says "let's focus on what we can control" or "let's keep this actionable." The conversation redirects to tooling.

That redirection is the postmortem failing. Not because the facilitator is bad at their job. Because the postmortem's implicit contract is to produce closure, not truth. And closure means stopping before the conversation gets uncomfortable.

Real causes implicate process, incentives, and organizational structure. The deployment failed because the team was pressured to ship before they were ready. The outage happened because reliability work isn't valued in promotion reviews. The data loss occurred because the person who flagged the risk was overruled by someone who wouldn't face consequences if things went wrong.

Those are uncomfortable truths. So the postmortem settles for "we need better tests" -- not because it's wrong, but because it's safe. And safe action items produce safe outcomes, which is to say: nothing changes.

Good coaches and structured excavation tools exist precisely for this problem -- they keep asking the next question after your comfortable first answer. If your postmortem keeps landing on tooling fixes for systemic problems, start your own excavation and see where the real thread leads.

What a postmortem should actually do

A postmortem that works does four things most postmortems skip:

Surface assumptions before analyzing. The way you frame the incident determines what you'll find. "The deploy failed" points at the deploy process. "A change reached production that shouldn't have" points at the entire pipeline of decisions upstream. Frame wrong, and you'll find the wrong cause with perfect confidence.

Branch, don't tunnel. Real incidents have multiple contributing causes. Following one thread and ignoring the others is how you end up with a root cause that explains 30% of what happened. Recurring problems almost always have multiple branches. Follow them.

Challenge before declaring. Before you write "root cause" on anything, try to disprove it. What evidence would make this wrong? If you can't answer that question, you don't have a root cause -- you have a hypothesis you stopped testing. This is the step that separates analysis from storytelling.

Reject "human error" as a finding. If your root cause is "someone made a mistake," you haven't found a root cause. You've found where the analysis gave up. The question is always: why did the system allow this mistake to have consequences?

A better postmortem structure

Here's a five-step structure that forces the postmortem past the comfort barrier. It's not complicated. It's just harder to do than the standard template because it doesn't let you stop early.

STATE. Write one sentence describing what happened. Get everyone in the room to agree on it. This sounds trivial. It isn't. If your SRE says "the database went down" and your product manager says "users couldn't access their data for three hours," those are different framings that lead to different root causes. One points at infrastructure. The other points at user impact and recovery. Agree on the statement before you analyze anything.

SURFACE. Name the assumptions in your framing. "The database went down" assumes the database is the problem unit. What if the problem unit is "the system had a single point of failure that nobody was incentivized to fix"? What assumptions are you carrying about who was responsible, what should have been caught, and why it wasn't? As we explored in You Don't Have a Problem, the stated problem is almost never the real one.

DRILL. This is the "why" step, but structured. Instead of open-ended "why did this happen?" -- which lets people gravitate to comfortable answers -- present specific options:

Which is closer to the truth? A) The team didn't know this was a risk B) The team knew but didn't have time to address it C) The team flagged it and was told to ship anyway

Each option leads somewhere different. Option A is a knowledge problem. Option B is a prioritization problem. Option C is a power-structure problem. Open-ended "why" lets the room drift to A when the truth is C.

CHALLENGE. Take whatever root cause you've reached and try to break it. "We think the root cause is that we didn't have adequate monitoring." Challenge: "If we'd had perfect monitoring, would we have had the authority and incentive to act on the alert at 2am on a Friday before a launch?" If the answer is no, monitoring wasn't your root cause. It was a convenient place to stop.

FORK. Generate real options for intervention -- not from the surface incident, but from the root cause you've validated. If the root cause is "reliability work isn't valued in the promotion process," your options aren't "add monitoring." They're "restructure how reliability contributions are evaluated," "create a dedicated reliability team with its own success metrics," or "make incident prevention a required component of senior engineer expectations." These are harder. They're also the only options that prevent the next incident.

The test

After every postmortem, ask one question:

"If we do everything on this action list, will we see a structurally similar incident in six months?"

Be honest. If the answer is yes, the postmortem found a symptom, not a cause. Go back. Drill deeper. Find the thing that's uncomfortable to name.

The action items from a good postmortem feel uncomfortable. They implicate process and incentive structure, not just tooling. "Change how we evaluate reliability work in performance reviews" is uncomfortable. "Add an alert" is not. The uncomfortable action item is almost always the one that actually prevents the next incident.

If every action item on your list could be completed by a single engineer in a sprint, you haven't gone deep enough. Real root causes require organizational change, not just code changes.

Here's another way to calibrate: look at your last five postmortem action item lists. If they all look the same -- monitoring, alerts, documentation, tests -- your postmortems aren't finding root causes. They're generating the same safe outputs regardless of the input. That's not analysis. That's a template producing a template.

Try it yourself

The gallery has real excavation sessions where engineering leaders worked through incidents like this -- past the comfortable answer, to the actual cause. See the method in action, then start your own excavation.

See this method applied: Browse the gallery

YOUR TURN

See root-cause excavation in action

Browse real sessions in the gallery, or start your own.