Problem.Cockpit

Root Cause Analysis for Engineering Teams -- Beyond 5 Whys

root-cause-analysisfive-whysengineering-processfirst-principles

Five Whys is the default root cause analysis tool. It's simple, it's fast, and for straightforward problems, it works. But if you're leading an engineering team dealing with complex, systemic issues, you've probably noticed that 5 Whys breaks down exactly when you need it most.

I've used 5 Whys on hundreds of problems. Here's where it fails -- and what to do instead.

Where 5 Whys breaks

It's linear. Real problems branch. "Why did the deployment fail?" might have three parallel causes: a flaky test, a missing config, and a communication gap about who owns the deploy process. 5 Whys picks one branch and follows it. The other two remain hidden.

It stops at "human error." Ask why enough times and you'll eventually land on "someone made a mistake." That's not a root cause. That's where the analysis gave up. The question isn't who made the mistake -- it's why the system allowed the mistake to have consequences.

It doesn't surface assumptions. Every "why" question carries embedded assumptions about what matters. If your first "why" is "Why did the deployment fail?" you've already assumed the deployment process is where the problem lives. What if the real issue is upstream -- in how requirements are communicated, or how the team decides what's ready to ship?

It doesn't challenge its conclusions. 5 Whys ends when you feel like you've reached the root cause. But feeling isn't knowing. Without a challenge step, confirmation bias drives the entire analysis. You find what you expected to find.

What's missing

The gap between 5 Whys and effective root cause analysis comes down to four things:

  1. Assumption surfacing -- Before you drill, examine the frame. What are you assuming about the problem?
  2. Branching -- Follow multiple threads, not just one.
  3. Pattern recognition -- Look across symptoms. Are they connected?
  4. Self-challenge -- Try to disprove your own conclusion before acting on it.

These aren't radical ideas. They're just the pieces that 5 Whys leaves out.

A seven-stage process

I built a structured excavation process that addresses each of these gaps. Here's how it maps to what you already know from 5 Whys -- and where it goes further.

STATE (5 Whys has this implicitly)

Articulate the problem as you currently understand it. This sounds obvious, but most teams skip it. They jump from "something's wrong" directly to "why?" Without a clear problem statement, the analysis drifts.

SURFACE (5 Whys doesn't do this)

Name the assumptions embedded in your problem statement. "Our deployments are unreliable" assumes deployments are the problem unit. What if the real unit is "changes" -- and some changes are unreliable regardless of how they're deployed?

This step catches framing errors before they poison the entire analysis.

DRILL (like asking why, but structured)

This is the 5 Whys equivalent, with a key difference: instead of open-ended "why?", the process presents specific options derived from the problem. Open-ended questions let people dodge uncomfortable answers. Structured options force confrontation.

Instead of "Why is this happening?" you get:

Which is closer to the truth? A) The team doesn't have enough time to test properly B) The team doesn't trust the existing tests C) The team doesn't agree on what "tested" means

Each option leads to a different branch. You follow the one that resonates, but you're aware the others exist.

PATTERN (5 Whys doesn't do this)

After drilling, step back. Look at the symptoms you started with. Do they connect? The deployment failures, the sprint overruns, the on-call burnout -- are they branches of the same root?

Pattern recognition is what turns individual problem-solving into systemic understanding. 5 Whys solves one problem at a time. Pattern recognition solves the system.

CHALLENGE (5 Whys doesn't do this)

Before you declare a root cause, actively try to disprove it. "We think the root cause is X. What evidence would prove us wrong? Does that evidence exist?"

This single step eliminates more false root causes than any other. Most teams skip it because finding the root cause feels like progress, and challenging it feels like going backward. It's not. It's the difference between guessing and knowing.

PHYSICS (the irreducible truth)

This is where 5 Whys would stop -- if it got here. The physics is the irreducible truth beneath the problem. It's the thing that, if it changed, would make all the symptoms disappear.

Physics are often uncomfortable. "The team doesn't trust each other's code" is a physics. "The incentive structure rewards shipping over reliability" is a physics. "The technical debt exists because nobody's career is advanced by paying it down" is a physics.

FORK (5 Whys stops; this continues)

5 Whys finds a cause (hopefully) and stops. But finding the cause isn't the same as deciding what to do about it. The FORK stage generates actionable paths forward -- not from the surface problem, but from the physics.

If the physics is "the incentive structure rewards shipping over reliability," the fork options might include changing how reliability work is recognized, creating a dedicated reliability role, or restructuring on-call ownership. None of these would emerge from "deployments keep failing."

The challenge step changes everything

If you take one thing from this article, make it the challenge step. Before you act on any root cause finding, spend five minutes trying to break it.

"We think the root cause is that requirements are unclear." Challenge: "What if the requirements are perfectly clear, but the team interprets them differently because they have different context about the user?" Result: The root cause isn't unclear requirements. It's unshared context.

That five-minute challenge just redirected your entire solution. Without it, you'd have invested in better requirements documents. With it, you invest in shared context -- a fundamentally different (and more effective) intervention.

The best root cause analysis challenges its own findings before declaring them. Everything else is confirmation bias with extra steps.


Try it yourself

The gallery has real excavation sessions where CTOs worked through problems like this. See the method in action, then start your own excavation.

See this method applied: Browse the gallery

YOUR TURN

See root-cause excavation in action

Browse real sessions in the gallery, or start your own.