Dark landscape drawing with green, brown, and blue crayon strokes below a black sky with white scribble marks

On call

Not even a doctor


Somehow despite not being physicians, technology professionals ended up carrying digital pagers and having an on-call schedule.

Keeping the fine details vague for obvious reasons.

This was a fun 24hrs.

Shift starts.

Quiet day, nothing outside the ordinary.

Then as the evening is winding down, there’s an alert. One of our monitors failed. Pull up the dashboards, don’t see anything that looks off.

Figure it’s nothing major and make a note on my whiteboard to bring it up at standup in the morning.

Get to my desk and see someone post in the team-chat that the monitor is still failing.

A couple minutes later I’m added to a group-chat with account managers because their customers lost access.

Quickly hit up everyone in the wirearchy that’s related to how our app handles permissions.

Get bounced through a couple people to ultimately find out the permissions team made a breaking change.

Obvious question, why weren’t we informed?

Turns out they were in constant communication with teams they determined would be impacted. Unfortunately ours was missed because we weren’t following the standard integration pattern.

I met with the dev who implemented the change to figure out what we would do on our end because we didn’t want to immediately escalate to a skip who could force them to rollback. Took a while to build the bridges, not worth immediately burning them down.

Turns out what we were doing was legacy as of last night. If we wanted to keep using the service, we’d have to get the user’s full permissions data, and then compute what access they have. Not necessarily a bad idea.

But not happy with it because the entire reason we migrated over to the service was to stop doing just that from a different datasource. On paper though, the solution was sound because we’d used it before.

I took this back to my tech lead because this was one of those companies that didn’t like altering stories mid sprint, and this was turning into something that would need time to properly test after implementing. Which is kinda hard to do when there’s already customers impacted.

They’re not thrilled about it either. So we bring in product and business. Explained it in simple terms. Everything in the app is working as normal, but there’s a downstream API call that is getting made to a service that used to tell us if the user has the right permissions or not. And even more simple terms, the permission id we were checking for is now legacy.

Then explained the potential fix, effectively going back to the clunky business logic we used to have.

I forget who, but someone asked if the new process is a better long term experience for customers, and that everyone but our subset had onboarded to it. Then why don’t we as well. Before customers would have to onboard to each individual permission they wanted, now there was a base permission that all individual ones were derived from.

So we update the config to use the new id. My favorite kind of fix.

Merge the PR. Cut a production release. Should be in the clear.

Nope. Hit some failures going through the pipeline because of course there has to be friction.

But ultimately it deployed successfully.

Had plans that evening, cancelled, but crisis averted.

Morning comes around, account managers are informed, who then inform their customers about the changes and walk them through onboarding.

Had to talk to some managers about what happened as I made my way through the first cup of coffee.

No overly dramatic escalations. Avoidable for sure, measuring twice and all that. But ultimately it boiled down to an honest gap in communication.