Infrastructure
Burning out at work
Incident Response

Subject Matter Amatuers On-Call

Who should be on call in a modern incident response lifecycle?


My colleague Ian Westcott and I frequently share how our incident response process works with other departments and organizations that want to replicate our success. We run over twenty custom applications in production, serving billions of requests, but have an incident response process that is relatively dull and uneventful — as it should be!

Many factors make our on-call process only slightly unpleasant to participate in (as opposed to how typically awful it can be), but for this post, I will focus on one frequent question. I get asked, “who should be on-call,” which is followed by “shouldn’t it be the subject matter expert for each application?”

The answer is no, absolutely not.

Abusing Subject Matter Experts

Let’s think about this for a moment. If an organization doesn’t have a mature on-call process, SMEs have no choice but to be perpetually on-call. Alerts and notifications inundate these valuable team members that are both large and small — but mostly small — which can lead to incident blindness. Organizations that choose to put their SMEs as their first line of response burn people out and have retention problems, then, not surprisingly, they have knowledge gaps when people leave.

Sharing the Burden

The on-call burden is the responsibility of the entire organization. Engineers, non-engineers, and yes, even SMEs should sometimes be on call in an equitable rotation. Being on-call doesn’t mean that you, the-person-on-call, must also be the one to resolve every issue. The person on-call is the incident acknowledger and investigator, but the person on-call doesn’t need to be the resolver.

Runbooks and First-Tier Support

Directly following the steps outlined in a Runbook[1] (created by SMEs) is the primary responsibility for the on-call person acting as first-tier support. With a good Runbook in place, on-call personnel doesn’t need to have much application-specific knowledge. First-tier support[2] can attempt some basic troubleshooting, identify false positives, notify stakeholders, and escalate an issue to an SME (when necessary).

Making Incident Response Sustainable

Being on call is always going to suck, but it is more sustainable if it sucks a little for everyone instead of extremely sucking for just a few people. When more people are involved, Runbooks get the care and attention to become useful guides for addressing any incident. With the pain spread around, there is more motivation within the organization to create, track, and complete action-items to prevent future incidents.

For more information, take a look at the links below, which describe the AWS Well-Architected Framework and some best practices from Pagerduty.


Citations

  1. AWS Well-Architected Framework, Concepts, Runbooks: https://wa.aws.amazon.com/wat.concept.runbook.en.html
  2. Laban, J. (2018) PagerDuty, On Call Best Practices: Laban: https://www.pagerduty.com/blog/on-call-best-practices-page-your-manager/

About the Author

Ryan's FaceRyan Mahoney is the director of technology for the customer-facing technology department of state of Massachusetts’ public transportation agency. He has spent the past two decades leading engineering teams as a founder, director, manager, and tech lead working with brilliant engineers that make positive impacts with their work.

More Posts in Infrastructure