oncall & escalations

If you truly care about your customers, uptime isn’t optional - it’s a core product feature. Even with the best test coverage, code reviews, and QA practices in place, things will still break. That’s reality. The real question is: how fast can you detect it, respond to it, and prevent it from happening again?

This is where operational maturity begins - with smart monitoring, fast alerts, and a culture of accountability. Use canary tests and continuous synthetic monitoring to catch issues before your users do. Instrument everything. Log wisely. Set up alerts that are precise enough to avoid noise, but sensitive enough to flag real risk. If a core feature breaks, you should know before anyone else - not from a support ticket or a tweet.

Then comes the human side: on-call. And here's the truth - ownership works best when engineers who build the system are also responsible for running it. “You build it, you run it” isn’t a punishment; it’s empowerment. It drives better design, more thoughtful deployments, and a deeper sense of accountability.

But let’s be honest - on-call is hard. So build systems that reduce pain. Rotate fairly. Balance the load. Minimize after-hours wake-ups through smarter design and better automation. If you can, consider a "follow-the-sun" rotation with global teams to limit disruptions. And always respect your people’s time.

Define what’s truly urgent - which core flows, which customers, which performance thresholds - deserve immediate escalation. Have clear playbooks. Provide your engineers with sharp tools, great observability, and instant access to recent deploys and logs. Speed matters - not just to stop the bleeding, but to earn trust.

After the fire’s out, the real work begins: learning. A good post-mortem is never about blame - it's about clarity. What happened, when, why, and how do we make sure it doesn’t happen again? Include a crisp summary, detailed timeline, root cause, learnings, and real action items. Link them to tickets. Track them. Close the loop.

Great teams don’t just fix incidents — they evolve from them.