Behavioral: a critical production bug broke the UI — how did you handle it

Tell it in STAR with an incident-response spine: stop the bleeding first (rollback / feature flag / hotfix), communicate early and often to stakeholders, find root cause with data (logs, error tracking, the diff), then fix forward and add a regression test + monitoring so it can't recur. Show calm, ownership, and a blameless postmortem mindset.

5 min read·~8 min to think through

Production-incident questions test whether you stay calm, methodical, and owning under pressure — and whether you fix the system, not just the symptom.

The incident-response spine

Stop the bleeding first. The priority is restoring users, not finding who's to blame. Rollback the deploy, flip the feature flag off, or push a targeted hotfix — whichever is fastest.
Communicate immediately. Tell stakeholders/support what's broken, who's affected, and the ETA. Silence is worse than bad news.
Diagnose with data. Error tracking (Sentry), logs, the recent diff, "what changed in the last deploy." Reproduce it.
Fix forward once stable — proper fix, reviewed, deployed.
Prevent recurrence. Regression test, monitoring/alert, and a blameless postmortem.

STAR example

Situation: After a Friday deploy, the checkout page went blank for ~15% of users — a third-party script started throwing inside a component with no error boundary. Task: I was on call; my job was to restore checkout fast and prevent it happening again. Action: I checked Sentry, saw the spike correlated exactly with the deploy, and rolled back within ~8 minutes — checkout was restored. I posted in the incident channel with impact and status. Then I reproduced it locally, found the unguarded third-party call, wrapped that area in an error boundary with a fallback, made the script load non-blocking, and added a Sentry alert on checkout-page error rate. Result: Total user impact was ~12 minutes. We shipped the real fix Monday with a regression test. In the postmortem we made "error boundary around third-party code" and "no Friday-afternoon deploys to checkout" team standards.

What interviewers listen for

Mitigate before diagnose — you don't debug a live outage while users suffer.
Communication is a first-class action, not an afterthought.
You used data (error tracking, the diff) to find root cause fast.
Blameless — "the system let this happen," not "X person broke it."
The story ends with systemic prevention, not just "I fixed the bug."

Senior framing

Juniors describe debugging. Seniors describe incident management: triage, mitigate, communicate, diagnose, prevent. The strongest answers show you turned one incident into lasting guardrails (error boundaries, alerts, deploy policy) — that's converting a bad day into permanent resilience.

Follow-up questions

•When do you roll back vs. fix forward?
•What is a blameless postmortem and why does it matter?
•How do error boundaries help contain UI failures?
•What monitoring would you add after this incident?

Common mistakes

•Trying to debug the root cause while the site is still down instead of rolling back.
•Not communicating with stakeholders during the incident.
•Stopping at 'I fixed the bug' with no regression test or monitoring.
•Blaming a person instead of the process/system.

Edge cases

•Rollback isn't possible (DB migration shipped) — must fix forward carefully.
•Bug is in a third-party dependency — mitigate with feature flag / error boundary while waiting.

Real-world examples

•Bad deploy blanks a page, a CDN/asset fails to load, a third-party script breaks, an API contract change crashes the UI.