How to Find the Root Cause of Recurring Engineering Problems
Your retro surfaces the same issue every sprint. Deployments break. On-call is a mess. The team "fixes" it, writes an action item, and moves on. Two sprints later, same issue, same retro, same action item. You're stuck in a loop.
I've been there. And I've learned that the loop exists because you're treating symptoms, not root causes. The stated problem is never the real problem.
The symptom trap
Here's what the loop looks like from the inside:
- Sprint 4: "Deployments keep breaking." Action item: add more tests.
- Sprint 6: "Deployments keep breaking." Action item: add deployment checklist.
- Sprint 8: "Deployments keep breaking." Action item: blame the new hire.
Each "fix" addresses what happened, not why it happened. More tests don't help if nobody trusts the test suite. A checklist doesn't help if the underlying process is broken. And blaming individuals is just a way to stop asking uncomfortable questions.
What's actually going on
The real problem lives beneath the surface. I call the bottom layer "physics" -- the irreducible truth that, once you see it, explains everything above it.
Take the deployment example. Watch what happens when you actually drill:
Deployments keep breaking. Why? Because changes ship without adequate testing. Why? Because the team doesn't trust the test suite. Why don't they trust it? Because the tests were written to pass CI, not to catch regressions. Why were they written that way? Because the original deadline pressure rewarded green builds, not coverage quality.
The physics: the team's incentive structure rewards the appearance of quality over actual quality. That's the irreducible truth. No amount of "add more tests" fixes an incentive problem.
A structured way to get there
I built a process for this. It has seven stages, and it's designed to prevent you from stopping too early (which is what 5 Whys does) or going in circles (which is what most retros do).
- STATE -- Articulate what you think the problem is. Most people skip this and jump straight to solutions.
- SURFACE -- Name the assumptions baked into your framing. "Deployments keep breaking" assumes the deployment process is the problem. Is it?
- DRILL -- Ask why, but with specific options derived from the problem, not open-ended. Open-ended "why" lets you dodge the uncomfortable answers.
- PATTERN -- What connects the symptoms? The deployment failures, the on-call burnout, the slow feature delivery -- are they branches of the same root?
- CHALLENGE -- Actively try to disprove your conclusion. If you can't break it, it's probably real.
- PHYSICS -- Name the irreducible truth. This is the thing that, if it changed, would make all the symptoms disappear.
- FORK -- What can you actually do about it? Not what's ideal. What's actionable given your constraints.
Why root causes are uncomfortable
Here's the part nobody tells you: root causes are almost never technical bugs. They're about people, processes, or incentives.
"The test suite is bad" is a symptom. "The team was incentivized to ship fast over ship correct" is a root cause. "Requirements are unclear" is a symptom. "The product owner doesn't have access to users" is a root cause. "We keep having outages" is a symptom. "Nobody owns reliability because it's not in anyone's performance review" is a root cause.
The reason recurring problems recur is that the real cause is uncomfortable enough that the team unconsciously avoids naming it. The retro becomes a ritual of symptom management.
Breaking the loop
The way out is simple but not easy: stop treating the symptom. Go deeper. Name the uncomfortable thing. Then decide what to do about it.
This doesn't require a consultant or a framework. It requires honesty and a willingness to sit with the discomfort of finding out that the problem is structural, not technical.
Every recurring engineering problem I've excavated -- every single one -- had a root cause that was obvious in retrospect and invisible in the moment. The deployment problem was an incentive problem. The velocity problem was a context problem. The quality problem was a trust problem.
The stated problem is never the real problem. The real problem is the one nobody wants to name.
Once you see the physics, you can't unsee it. And that's when things actually change.
Try it yourself
The gallery has real excavation sessions where CTOs worked through problems like this. See the method in action, then start your own excavation.
See this method applied: Browse the gallery
YOUR TURN
See root-cause excavation in action
Browse real sessions in the gallery, or start your own.