If checkout goes down during a high-traffic sale, what does your first 10 minutes look like
Incident-response question. First: mitigate, don't debug — acknowledge, assess blast radius, communicate. Roll back / flip the feature flag to restore service FIRST, then root-cause. Show calm prioritization: stop revenue loss, keep stakeholders informed, only diagnose once users are unblocked.
This is an incident-response question. The interviewer wants to see calm prioritization under pressure — and the core instinct: mitigate first, debug later.
The first 10 minutes — in order
1. Acknowledge & assess (first ~1–2 min).
- Confirm it's real and scope the blast radius: all users or some? All checkout or one step? Which regions/browsers? What's the actual revenue/user impact?
- Check dashboards, error rates, recent deploys. The single most useful question: "what changed?" — a deploy, a config change, a third-party (payment provider) outage.
2. Communicate immediately.
- Open an incident channel, page the right people, declare an incident. Don't debug silently.
- Notify stakeholders (support, PM, leadership) — during a sale, the business needs to know now. Support needs a line for customers.
3. Mitigate — restore service, do NOT root-cause yet. This is the heart of the answer. During a high-traffic sale, every minute is lost revenue — getting checkout working beats understanding why it broke.
- Recent deploy? → roll back. Fastest, safest.
- A specific feature/change? → flip the feature flag off.
- Infra/scaling? → scale up, shed load, enable a fallback path.
- Third-party (payment gateway) down? → fail over to a backup provider, or queue/degrade gracefully.
Pick the fastest lever that restores service even if you don't yet fully understand the cause.
4. Verify recovery.
- Confirm via monitoring + a real test transaction that checkout works again. Update the incident channel and stakeholders: "mitigated, service restored, investigating root cause."
Only AFTER the bleeding stops
Root-cause analysis, the proper fix, and a blameless postmortem — what happened, why, and what guardrail prevents a repeat (better alerting, canary deploys, load testing, a circuit breaker on the payment provider).
What signals seniority
- Mitigate before diagnose — the instinct to roll back/flag-off first. Junior instinct is to start reading code.
- Communication is part of incident response, not an afterthought.
- Calm, structured prioritization — revenue impact drives urgency.
- "What changed?" as the fastest diagnostic shortcut.
- Treating the postmortem as where the real value is.
The framing
"Mitigate first, debug later. First minute or two: confirm it's real, scope the blast radius, and ask 'what changed' — a deploy, a config, a payment-provider outage. Immediately declare an incident and communicate — support and leadership need to know during a sale. Then I restore service with the fastest lever: roll back the deploy, flip the feature flag, fail over the payment provider — without fully root-causing yet, because every minute is lost revenue. Verify recovery with a real transaction, update everyone. Only once users are unblocked do I do root-cause analysis and a blameless postmortem to add the guardrail."
Follow-up questions
- •Why mitigate before finding the root cause?
- •What's the fastest way to restore service if a deploy caused it?
- •Who do you communicate with during the incident, and when?
- •What happens after service is restored?
Common mistakes
- •Diving into the code to debug before restoring service.
- •Not communicating — debugging silently while revenue bleeds.
- •Not checking 'what changed' (recent deploys/config) first.
- •Skipping the postmortem once the fire is out.
- •Panicking instead of working a structured checklist.
Performance considerations
- •The relevant metric is mean-time-to-recovery (MTTR). Fast rollback paths, feature flags, payment-provider failover, and good alerting all reduce MTTR — that's the systemic prep an incident reveals.
Edge cases
- •Root cause is a third-party (payment gateway) outage you don't control.
- •Rollback isn't possible (data migration shipped with the deploy).
- •Only a subset of users affected — still an incident.
- •Multiple changes deployed together — hard to isolate.
Real-world examples
- •Rolling back a deploy within minutes when error rates spike post-release.
- •Failing over to a backup payment provider when the primary gateway has an outage during a sale.