workengineeringhabits

On-call and the board that saved my sanity

Alex, senior software engineer and author of Alex's Whiteboard blog

Alex

· 5 min read

On-call is a particular kind of cognitive load. You're not fully focused on any one thing. You're watching several things at once, holding the state of a system in your head, trying to keep track of what's degraded, what's been mitigated, what you're still unsure about. It's hard to do that in an app because you're switching between the app and the terminal and the dashboards constantly.

My on-call routine now involves the board. When I pick up the rotation, I write a quick summary of the current state — what's known, what's being watched, any incidents from the last shift I need to be aware of. It takes five minutes and it means I don't have to hold all of it in working memory. I can look at the board.

During an incident specifically, the board becomes the single source of truth. Timeline on the left. Current hypothesis in the middle. Things we've already tried on the right. When you're deep into a hard problem with three other people, having a shared visible space is the difference between working together and talking past each other.

I've done enough incidents now to know that the ones that go badly almost always have a coordination problem. People are investigating the same thing independently. Hypotheses get retested. Someone rolls back something before confirming that's actually the right call. The board doesn't fix the underlying system but it does fix the coordination, and coordination is usually where you lose the time.

The thing I always do at the end of an incident before I erase the board is take a photo. That photo becomes the basis for the post-mortem. The timeline is already there. The dead ends are already documented. It saves an hour of trying to reconstruct what happened from memory and logs.