How to Survive Your First On-Call Shift as a Software Engineer

The first time my phone buzzed with a PagerDuty alert, I was three months into my first software engineering job, alone in a sublet apartment, and absolutely convinced I had broken production. My first on-call shift as a software engineer taught me that the panic is normal, the runbooks are sacred, and almost every page can be handled by a calm person with a checklist. This guide is the one I wish someone had handed me before that first rotation: what to set up before you go on call, how to triage an alert at 3 AM without making things worse, and what to do the next morning so the same page does not wake you up next week. If you are about to start your first rotation, save this and breathe.

. . .

Key Takeaways

  • Before you go on call, install PagerDuty or Opsgenie on your phone, test it with a fake alert, and confirm you can VPN in from your laptop in under five minutes.
  • Read every runbook for services you own, and if a service does not have one, write a stub before your shift starts.
  • When a page fires, the order is: acknowledge, assess scope, mitigate, then investigate. Do not skip steps two and three to jump to four.
  • It is always okay to escalate. The senior engineer on backup expects to be paged and will not be annoyed.
  • The next morning, write a short postmortem note, even if you fixed it in five minutes. Future you will thank present you.

. . .

The Day Before: Setup That Actually Matters

The shift starts before the shift starts.

Laptop, coffee, and notebook on a desk before an on-call shift

I spent the afternoon before my first rotation in a coffee shop in Cambridge, opening every dashboard I might need and bookmarking them in a Chrome folder called ONCALL. Looking back, that small piece of setup probably saved me twice over the next week.

Here is what I now do the day before any rotation, no matter how senior the team gets.

Install the paging app on your phone, in this case PagerDuty, Opsgenie, or whatever your company uses. Turn off Do Not Disturb for that one app. Send yourself a test page and confirm it actually wakes you up at full volume. I know one engineer who slept through her first page because her phone was on silent in the kitchen. Do not be that engineer.

Test your VPN. Then test it again from your laptop in bed, because sometimes corporate VPN clients behave differently on hotel wifi than they do on the office network. If your VPN needs a hardware token or a 2FA app, charge those devices and put them on your bedside table.

Open every dashboard, Grafana, Datadog, Honeycomb, CloudWatch, whatever you use, and confirm you can actually log in. Save the URLs in a single browser folder. When a page hits at 3 AM, you do not want to be hunting for the right Looker workspace.

Read the runbook for every service you own. If a runbook is missing or feels thin, this is the perfect moment to add a paragraph. Even a sentence that says "if this alert fires, check the Redis connection pool first" is gold to the next person on call.

Finally, find out who your backup is and how to reach them. Most pager systems have an escalation policy with a senior engineer on standby. Know their name, know their Slack handle, and know that you are absolutely allowed to wake them up if you need to.

. . .

When the Page Fires: The Four Steps I Wish I Knew

Acknowledge, assess, mitigate, investigate. In that order.

The first time my pager went off, I made every mistake possible. I jumped straight into debugging without checking how many users were affected. I tried three fixes at once. I forgot to silence the alert, so it kept paging me every five minutes while I worked. By the time the senior engineer on backup checked in twenty minutes later, I was a sweaty mess and the issue had self resolved on its own.

Now I follow a four step rhythm, and it has saved me every time since.

Step one, acknowledge. Tap the button that tells the system you are awake and on it. This stops the page from escalating to your backup, and it starts the clock on response time. It does not commit you to a fix yet. It just means a human is now looking.

Step two, assess scope. Before touching anything, ask three questions. How many users are affected? Is this getting worse, getting better, or steady? Is anything else broken downstream? Open the relevant dashboards. Look at error rates over the last hour, not the last five minutes. A spike that is already trending down is very different from a flat line at high error.

Step three, mitigate. Mitigation is not the same as a fix. If a deploy thirty minutes ago broke things, the mitigation is a rollback, not a code patch. If a single instance is misbehaving, the mitigation is to restart it or remove it from the load balancer. Stabilize first, then debug.

Step four, investigate. Now you can read logs, trace requests, and figure out what actually happened. This is the fun part. Save it for last.

. . .

The Most Useful Sentence in On Call: I Need a Second Pair of Eyes

Escalation is a skill, not a failure.

Two software engineers pair debugging on a laptop screen

For my first three rotations I treated escalating to my backup as some kind of personal defeat, like I had failed a test. Then a staff engineer named Priya told me something that changed how I think about on call.

She said, "When you escalate, you are not asking me to fix your problem. You are inviting me to learn the system with you. That is literally my job."

If you do not know what an alert means, escalate. If you have tried two things and neither worked, escalate. If the dashboard shows something you have never seen before, escalate. If you are starting to feel panicky, escalate.

A good senior engineer would much rather be paged at 2 AM for a real incident than wake up to a five hour outage you tried to solve alone. Most companies have an explicit on call culture document that says exactly this. If yours does not, ask a senior engineer about it during onboarding. The answer is almost always "page me anytime."

Slack is also your friend during business hours. Drop a quick message in your team channel before you go heads down, something like "alert firing for payment service, looking into it." Two things happen. Someone might have context you do not, and the rest of the team knows where you are if you go quiet.

. . .

What to Do Between Pages

Boredom is a feature, not a bug.

On a healthy team, most on call shifts are quiet. You might go a full week with no real pages. That is the goal. It does not mean you are not doing anything.

I use the quieter parts of on call to do what we call alert hygiene. I look at every alert that fired in the last month, even the ones that auto resolved. For each one I ask, was this actionable? Could a human have done anything? Did the runbook actually help? If the answer to any of those is no, I file a ticket to fix it.

I also pick up small on call only tasks, things like adding logging to a confusing code path, writing or updating a runbook, or improving a flaky alert threshold. These are exactly the kind of changes that make next quarter's on call shifts easier, and they are perfect to do when you cannot start a big project because you might get paged in fifteen minutes.

If your team uses something like incident.io or FireHydrant, take a quiet hour to walk through the incident tooling. Practice opening an incident, declaring severity, and writing a status update. The first time you do these for real, you do not want it to also be the first time you have seen the interface.

. . .

The Morning After: Write the Note You Wish Someone Had Written

Five minutes now saves five hours later.

After every page, even the ones that resolved themselves before you logged in, write a short note. Not a formal postmortem, just a paragraph or two in a doc your team can find. Mine has four sections.

What alerted. The exact alert name, the time, what triggered it.

What I checked. The dashboards I looked at, the queries I ran, the metrics that helped me.

What I did. Including the things that did not work. Especially the things that did not work.

What should change. A runbook tweak, an alert threshold to adjust, a piece of missing documentation, a question for the team.

After three months of these notes, my team had a small library of real, lived incident lore that was infinitely more useful than the official runbooks alone. Junior engineers reading these notes learn how senior engineers actually think under pressure. Senior engineers reading these notes catch patterns nobody else would see.

If your team does proper postmortems for bigger incidents, your morning note becomes the rough draft. You will be very grateful you wrote it.

. . .

What I Tell Every New Engineer Before Their First Rotation

You will be okay.

The first on call shift is a rite of passage. The anxiety is real. The fear that you will make something worse is also real, because sometimes you will, in small ways. That is fine. Production systems are resilient, your teammates are kind, and almost every incident is recoverable.

The best on call engineers I know are not the ones who fix things fastest. They are the ones who stay calm, follow the playbook, and write things down. That is it. Calm, methodical, well documented. You can do that.

Set up your environment the day before. Acknowledge fast, mitigate before you debug, escalate without shame. Write the note. Pet a cat if you have one, or pace around your apartment if you do not.

You are going to be a great on call engineer.

. . .

FAQ

How long is a typical on call rotation for software engineers?

Most teams run weekly rotations, Monday to Monday, with a primary engineer and a backup. Some teams do shorter rotations, like four days, especially if pages are heavy. Ask your team during onboarding so you can plan your sleep and social calendar around it.

What do I do if I get paged for something I do not own?

Acknowledge the page so it stops escalating, do a quick check to see if the alert is actually fake or misrouted, then reach out to the owning team via Slack or your incident tool. If you cannot find the owning team within ten minutes, escalate to your backup. Misrouted pages are common and not your fault.

Should I work on regular tickets while I am on call?

Pick small, self contained tasks you can drop in five minutes. Avoid anything that requires deep focus or long stretches of coding. Many engineers use on call weeks for code review, documentation, runbook updates, and alert hygiene.

What if I cannot fix the issue at all?

That is what escalation is for. Page your backup, explain what you have tried, and stay on the call to learn from how they handle it. The point of on call is not to be a hero. It is to be a calm first responder.

Do I really need to write a note for every alert?

Yes, even the quick ones. The fifteen second alert that auto resolved at 3 AM might be a symptom of something bigger that hits at 3 PM next week. Future you wants past you to have written it down.

. . .

Keep Reading

If you liked this, you might enjoy Lessons from Debugging as a Junior Software Engineer, where I unpack a debugging story that almost made me quit, or How to Survive Your First Code Review as a Junior Developer, a companion piece for those of us still figuring this out.

What was your first on call story? Drop it in the comments. I read every one.

Follow me here for more honest writing on tech life, travel, and figuring it out as we go.

Areej Asif

CS grad and skincare obsessive who travels often. I write about tech, travel, cooking, and the messy art of growing up.

Post a Comment

Previous Post Next Post