Rules for a healthy on-call rotation

Here are some rules for a healthy on-call rotation, based on personal experience and benefiting from the experiences of others.

If you don't like the word “rules” then consider them to be guidelines or recommendations instead.

Some of these are practical things that you can implement immediately. Others are cultural changes that might take a while to embed in your organisation.

Note: The practice of responding to, managing, and communicating around incidents is a whole other topic, and not covered here.

Most of them are, in some way, intended to help manage information, and make sure it is clearly communicated within the on-call team and to the rest of the organisation.

It's likely incomplete, I'll add to it as I consider / remember more things.

Assumptions

There's a few assumptions in what follows:

You will need a certain level of organisational maturity in order to achieve all of these.

tl;dr

Put together an on-call team like this:

At least two sites, close to 12h apart

Do: Split the on-call responsibility amongst at least two teams, close to 12 hours apart. Or 3 teams close to 8 hours apart.

Why: Don't put people on-call 24x7 and expect them to be good at it. Someone getting paged at 3am in the morning is not in a fit state to mitigate whatever problem they've encountered, and one or more nights of badly interrupted sleep are terrible for productivity and health the next day.

Another way of thinking about this is “Don't set SLOs you are not staffed to meet”.

My experience is mostly with companies with a presence in California and Europe, and I've seen a rotation split 7am-7pm CET / 10am-10pm Pacific work pretty well (with the caveat of the few weeks each year where daylight saving changes throw everything off by an hour).

6-9 people per site

Do: The on-call rotation for the service in each site should be 6-9 people.

Why: 6 people as the minimum per site implies each person in the rotation in the site is going to be on-call at least once every 6 weeks.

Realistically, with team members being out sick, or on vacation, or shifts needing to be swapped around, every 5 weeks is more likely.

On-call any more frequently than that and it's extremely difficult for individuals to make progress on their project work — they're being interrupted too frequently.

You can go up to 9 (perhaps even 12) people in the rotation per site, but any more can be problematic if the rate of change in your production infrastructure is high. You risk someone going on-call without enough knowledge/context about the production infrastructure to be effective.

Primary and secondary on-call responsibilities

Do: Each week have a primary and secondary on-call responsibility.

The primary's daily responsibilities are (in rough order of priority):

The secondary exists to be the backstop / safety net for the primary if they are unavailable or unable to respond. They are also the first person to step up if primary is dealing with a higher priority task from the previous list and a lower priority task can be dealt with.

So they should periodically — once or twice per day — set aside a few minutes to review any issues that have come in with the primary to ensure they are up to date with the state of the service, but shouldn't need to do more in a typical week.

Why: The primary on-call in each site is the person who gets paged for the service first.

That might be due to known issues (a commute where on-call response is impractical, child care responsibilities), or unplanned (technology failure).

In the case of planned unavailability it is the primary engineer's responsibility to co-ordinate with the secondary ahead of time.

In a well-running shift the secondary never receives an unexpected page.

The secondary at week N will be the primary in week N+1, allowing them to carry over the state they've absorbed from their week as secondary in to the primary week.

Example: In one on-call team I had a ~ 30 minute commute to the office where it would be difficult to respond to a page (10 minutes on a bus, 10 minutes on a train, a 15 minute walk, and some waiting time). Whenever I was primary I would co-ordinate with the secondary to make sure we were not commuting at the same time, and would positively acknowledge to them before the start and end of each commute.

Have at least three escalation levels, maybe more

Do: The primary on-call for the service gets paged first (level one).

If the primary doesn't acknowledge the page within some time, the secondary gets paged (level two).

If the secondary doesn't acknowledge the page within some time, everyone in the on-call team in that site gets paged (level three).

Optional:

If the team in that site does not acknowledge the page, the previous primary on-call in the other site gets paged.

Note: It's the previous primary, not the next primary, because if you're unlucky this is happening on the day of a shift change, and the next primary doesn't necessarily have a lot of context about the state of production.

Then the previous secondary in the other site.

The the on-call team in the other site.

Note: This is extremely belt and braces. In 15+ years of working in rotations like this I can count the number of times are page was missed by both the primary and secondary (and therefore fell through to the team in one site) on the fingers of one hand.

On-call team for a service is composed of engineers for that service

Do: The on-call team is composed of engineers, and ideally drawn from the regular software engineer population.

Why: If you're writing code that might page someone, you need to be prepared to be the person that's getting paged.

This doesn't mean the team is wholly composed of product software engineers. Perhaps a 2/3rd to 1/3rd split, with product engineers doing a 6 to 9 month “tour” with the team.

Note: “service” here might cover just a single replicated server, or it might cover a whole fleet of different servers working together to provide a service to your users. It really depends on your product and its architecture.

The team doesn't exist to be on-call

Do: Recognise that the team does not exist to be on-call.

Why: I've used the phrase “on-call team” repeatedly throughout this text, it's a convenient shorthand, but it risks obscuring a fundamental truth.

This is not a team that exists to be on-call.

This is a team that exists as one of many teams working together to try and ensure the service is meeting its SLOs.

Being on-call is a tool the team uses to help it achieve the goal, but that's it. Being on-call is not the goal of this team. Helping to prevent incidents, and swiftly mitigating the ones that do occur, is.

Compensate on-call appropriately

Note: There are legal issues here. For example, I believe that in Switzerland, you need (a) special approval from the govt. before you can ask require employee to work on a Sunday, and (b) employees must be able to take their accrued time off from on-call work within a certain period of time. I am not a lawyer, this is not legal advice.

On-call work should be compensated appropriately, and with no regard to whether or not any of the team were actually paged. Having to be on-call with a short response time is disruptive, even if you don't get paged.

My recommended baseline is as follows:

Suppose a 12 hour on-call shift and a 9 hour normal working day. Compensate the 3 extra on-call hours during the week days with equivalent time off at 50%, and the 12 extra on-call hours during the weekend days with equivalent time off at 100%.

I.e., someone doing one week of on-call, 12 hours per day, accrues 5 days x (3 hours @ 50%) + 2 days x (12 hours @ 100%) = 31.5 hours, or 3.5 working days.

This time is to allow people to recover from an on-call week, and catch up on activities that they were otherwise unable to do. I recommend that you strongly encourage employees to take this time relatively shortly after their on-call week ends (not necessarily in one large block), and expire any un-taken time on a rolling 9 month basis.

This is for time as the primary only. Time as the secondary is not ordinarily compensated, as it is not ordinarily that onerous.

Note: Special circumstances can occur. If there is an incident requiring many people to work over a weekend then treat it specially, and figure out how to fairly compensate everyone involved.

Don't mix and match on-call and project work

Do: When someone is primary on-call their regular project work does not exist. Their job is to be the primary on-call.

Why: Humans suck at multi-tasking. It can be tempting to try and be on-call and make progress on primary project work at the same time.

It doesn't work, and can be a significant source of stress. So don't do it.

That doesn't mean a particularly quiet on-call week sees the primary on-call sit around twiddling their thumbs wondering what to do. There's always something to do that can be safely interrupted:

and so on.

Note: On interviewing — you can, if necessary, be primary on-call and still carry out interviews. If you do then first clear it with whoever is secondary on-call and make sure they can cover for you for a period some time before the interview (however long you need to prepare) and after (however long you need to provide feedback). If this is not feasible then it is simpler to ensure the recruiters know who is on-call when, and that they are not scheduled for interviews while on-call.

Corollary: Budget for time “lost” to on-call work when project planning

Do: Make sure any project plans / expectations for how long a project will take properly account for the at-least-two-weeks anyone in the on-call rotation loses each quarter due to being on-call.

Why: Assuming a 6-person on-call rotation per site each person is going to be primary on-call at least twice in a quarter, maybe three times. That's a lot of time where they won't be working on their other projects.

It's important this is accounted for when estimating how long work will take.

This can be particularly pernicious when a project requires multiple people from the on-call team in order to complete it, and the project repeatedly stalls for a week because the next person on the critical path is now the primary on-call.

Have a written handover at the end of each shift

Do: At the end of each shift hand over responsibility to the next primary, ideally with a short log of anything that happened on the shift, reminders about any upcoming changes, and so on.

Why: It is important the next primary on-call has sufficient knowledge about the state of the system.

I like to use something that creates a permanent record. A shared document can work for this as a good starting point.

In previous teams we wrote a small web app that prompted the user for pertinent information, as well as querying other systems to pre-populate the handoff report with e.g., “Here are the tickets that were opened during this shift”.

The primary on the other site should positively acknowledge receiving the handover.

Have a weekly review / handoff meeting

Once a week the primary/secondary engineers are going to change.

As close to this time as possible you should have a handoff meeting, to ensure the incoming primary/secondary on-call engineers are aware of the state of production, any ongoing problems, and so on.

This is separate to any weekly “team meeting” that might exist (e.g., to discuss project work, incident trends, etc).

Invite / require other attendees — for eample, if you have a platforms team, or a networking team, attendance of one person from that team can be very helpful to provide additional context to discussions about incidents involving those teams, and they can talk through any upcoming work the on-call team needs to know about.

Typical agenda:

The team owns the on-call schedule

Do: Allow the team to modify the on-call schedule as necessary, swapping shifts, adjusting cover, and so on.

Why: I've seen an anti-pattern where a team's manager (or lead) decides any changes to the on-call schedule should be vetted or approved by them.

Do not do this. It adds no value to the process.

As long as there is a mechanism that accurately records who is primary/secondary at a particular time (so the alerts can be delivered correctly, and compensation can be calculated), let the team modify the schedule as necessary.

For example, if the primary discovers there's an afternoon where they are unable to be primary, it's up to them to work with the rest of the team to arrange cover — typically the secondary would step up, and someone else would take over the role as secondary. The team should be perfectly capable of doing this without a manager needing to approve any changes.