Tell me about a time when a critical production bug broke the UI — how did you handle it

Same as the critical-production-bug question: STAR with an incident spine — mitigate first (rollback/flag/hotfix), communicate early, diagnose with error tracking and the recent diff, fix forward, then add a regression test, monitoring, and a blameless postmortem so it can't recur.

5 min read·~8 min to think through

This is the standard production-incident behavioral. Answer it in STAR, structured around the incident-response lifecycle.

The lifecycle to hang your story on

Mitigate first — rollback, feature-flag off, or fast hotfix. Restore users before debugging.
Communicate — tell stakeholders/support the impact and ETA right away.
Diagnose with data — error tracking, logs, "what changed in the last deploy."
Fix forward — proper, reviewed fix once stable.
Prevent recurrence — regression test, alert/monitoring, blameless postmortem.

STAR example

Situation: A release made a date library throw on certain locales, crashing the dashboard for international users — no error boundary, so the whole page went white. Task: I owned the dashboard; I needed it back fast and protected against this class of failure. Action: I correlated the Sentry spike to the deploy and rolled back in ~10 minutes. I updated the incident channel with impact (locale-specific, ~8% of users) and status. Reproduced by switching my locale, found the unguarded date call, fixed the formatting, and wrapped major dashboard regions in error boundaries with fallback UI so one bad component can't blank the page. Result: Users were impacted ~14 minutes. The real fix shipped next day with a regression test covering multiple locales, plus a per-region error boundary rollout across the app. Postmortem action item: error boundaries became a code-review checklist item.

What earns points

Restore first, debug second.
Proactive communication during the incident.
Data-driven root cause (it's almost always "the last deploy").
A systemic fix (error boundaries, monitoring), not just the one-line bug fix.
Blameless tone throughout.

Senior framing

The differentiator is showing you handled an incident, not just a bug: triage → mitigate → communicate → diagnose → prevent. And that you left the system more resilient than you found it — error boundaries so a single component failure degrades gracefully instead of taking down the page, plus an alert so you hear about it before users do.

Follow-up questions

•Why are error boundaries important here, and what are their limits?
•How do you decide rollback vs. fix-forward?
•What goes into a good postmortem?

Common mistakes

•Debugging live instead of mitigating first.
•No communication step in the story.
•Ending at the bug fix with no prevention.
•Blaming the engineer who shipped it.

Edge cases

•Error boundaries don't catch errors in event handlers, async code, or SSR — mention this nuance.
•If a rollback wasn't possible, explain the careful fix-forward path.

Real-world examples

•White-screen-of-death from an unguarded render error, locale/timezone crashes, third-party script failures.