Technical

How to Debug Production Issues Like an Operator

6 min read

# How to Debug Production Issues Like an Operator

Production debugging is more art than science. Here's what I've learned from years of supporting live systems.

The Operator's Advantage

As an operations person, I learned to debug systems without access to code. You can't just grep a log file when you're on a customer support call. You have to think differently.

Rule 1: Reproduce First

The cardinal sin of debugging is trying to fix something you don't understand. Before you change anything, you must reproduce the issue.

This seems obvious, but people skip it constantly. They see a bug report and immediately start hypothesizing about causes. Wrong. First, reproduce. Actually see the broken state.

Rule 2: Narrow the Scope

Once you've reproduced it, narrow the scope. Is it: - All users or specific users? - All times or specific times? - All regions or specific regions? - A new issue or has it always happened?

The answers tell you where to look.

Rule 3: Follow the Data

Most issues are data issues. A customer is confused because they saw data that contradicted their expectations. A system failed because data was in an inconsistent state.

Follow the data. Where did it come from? Where did it go? Where did it diverge?

Rule 4: Assume Your Understanding Is Wrong

This is hard. But almost every bug I've found was because I misunderstood something fundamental about how the system actually works.

Don't trust your assumptions. Test them. Verify them. Prove them wrong.

Rule 5: Check the Boring Things First

The glamorous bugs are rare. Most issues are: - Wrong configuration - Permissions issue - Race condition under load - Off-by-one error - Timezone handling

Check the boring things first.