Behavioral: a critical production bug broke the UI — how did you handle it
Tell it in STAR with an incident-response spine: stop the bleeding first (rollback / feature flag / hotfix), communicate early and often to stakeholders, find root cause with data (logs, error tracking, the diff), then fix forward and add a regression test + monitoring so it can't recur. Show calm, ownership, and a blameless postmortem mindset.
Production-incident questions test whether you stay calm, methodical, and owning under pressure — and whether you fix the system, not just the symptom.
The incident-response spine
- Stop the bleeding first. The priority is restoring users, not finding who's to blame. Rollback the deploy, flip the feature flag off, or push a targeted hotfix — whichever is fastest.
- Communicate immediately. Tell stakeholders/support what's broken, who's affected, and the ETA. Silence is worse than bad news.
- Diagnose with data. Error tracking (Sentry), logs, the recent diff, "what changed in the last deploy." Reproduce it.
- Fix forward once stable — proper fix, reviewed, deployed.
- Prevent recurrence. Regression test, monitoring/alert, and a blameless postmortem.
STAR example
Situation: After a Friday deploy, the checkout page went blank for ~15% of users — a third-party script started throwing inside a component with no error boundary. Task: I was on call; my job was to restore checkout fast and prevent it happening again. Action: I checked Sentry, saw the spike correlated exactly with the deploy, and rolled back within ~8 minutes — checkout was restored. I posted in the incident channel with impact and status. Then I reproduced it locally, found the unguarded third-party call, wrapped that area in an error boundary with a fallback, made the script load non-blocking, and added a Sentry alert on checkout-page error rate. Result: Total user impact was ~12 minutes. We shipped the real fix Monday with a regression test. In the postmortem we made "error boundary around third-party code" and "no Friday-afternoon deploys to checkout" team standards.
What interviewers listen for
- Mitigate before diagnose — you don't debug a live outage while users suffer.
- Communication is a first-class action, not an afterthought.
- You used data (error tracking, the diff) to find root cause fast.
- Blameless — "the system let this happen," not "X person broke it."
- The story ends with systemic prevention, not just "I fixed the bug."
Senior framing
Juniors describe debugging. Seniors describe incident management: triage, mitigate, communicate, diagnose, prevent. The strongest answers show you turned one incident into lasting guardrails (error boundaries, alerts, deploy policy) — that's converting a bad day into permanent resilience.
Follow-up questions
- •When do you roll back vs. fix forward?
- •What is a blameless postmortem and why does it matter?
- •How do error boundaries help contain UI failures?
- •What monitoring would you add after this incident?
Common mistakes
- •Trying to debug the root cause while the site is still down instead of rolling back.
- •Not communicating with stakeholders during the incident.
- •Stopping at 'I fixed the bug' with no regression test or monitoring.
- •Blaming a person instead of the process/system.
Edge cases
- •Rollback isn't possible (DB migration shipped) — must fix forward carefully.
- •Bug is in a third-party dependency — mitigate with feature flag / error boundary while waiting.
Real-world examples
- •Bad deploy blanks a page, a CDN/asset fails to load, a third-party script breaks, an API contract change crashes the UI.