A critical UI feature is failing during peak traffic—how do you mitigate the issue
Incident response: stabilize first (rollback, feature-flag off, scale, shed load), communicate, then diagnose with metrics/logs. Peak-only failure points to load-dependent causes — backend saturation, rate limits, race conditions, memory. Add resilience after: graceful degradation, retries, caching, load testing.
A critical feature failing under peak load is an incident. The order matters: stabilize → communicate → diagnose → harden. Don't start with root-cause analysis while users are broken.
1. Stabilize — stop the bleeding (minutes)
Restore service first; understand it later.
- Roll back the most recent deploy if the failure correlates with one.
- Feature-flag the feature off (or to a degraded mode) — the rest of the app stays up. This is exactly why critical features should be behind kill switches.
- Scale up if it's a capacity issue — more instances, raise limits, scale the DB.
- Shed load / degrade gracefully — serve cached/stale data, queue non-critical work, disable expensive operations, show a clear "experiencing high load" state instead of errors.
- Mitigate the obvious — bump a rate limit, add a circuit breaker, enable a CDN/edge cache for the hot path.
2. Communicate — in parallel
- Open an incident, assign an owner/commander.
- Update a status page and stakeholders; tell support.
- Keep a running incident log (timestamps, actions) — invaluable for the postmortem.
3. Diagnose — once it's stable
Peak-only failure means the cause is load-dependent. Use metrics/logs/traces:
- Backend saturation — DB connection pool exhausted, slow queries under load, CPU/memory maxed, downstream service timing out.
- Rate limits — your own API or a third party throttling at volume.
- Race conditions — only manifest under concurrency.
- Resource exhaustion — memory leaks, connection/socket limits, cache stampede (everything misses cache at once and hammers the origin).
- Frontend — too many requests on load, no dedup, retry storms amplifying the problem, an N+1 request pattern.
- Correlate the failure start with deploys, traffic graphs, and dependency dashboards.
4. Fix the root cause (after the incident)
- Address what diagnosis found — query optimization, connection pooling, caching, fixing the race.
- Verify under load testing before calling it done.
5. Harden so it doesn't recur
- Graceful degradation designed in — the feature should fail soft, not hard.
- Resilience patterns — retries with backoff+jitter, circuit breakers, timeouts, request dedup/coalescing, caching with stale-while-revalidate.
- Load testing in CI/staging at realistic peak.
- Capacity planning & autoscaling; alerts that fire before full failure (saturation, error-rate, latency SLOs).
- Kill switches on all critical features.
- Blameless postmortem — what happened, why, action items.
The framing
"First priority is users, not curiosity — stabilize via rollback or kill switch, scale, and degrade gracefully, while communicating. Peak-only failure points to a load-dependent cause: backend saturation, rate limits, races, or cache stampede — I'd confirm with metrics and traces. Then fix the root cause, verify with load testing, and harden with degradation, circuit breakers, and pre-failure alerting."
Follow-up questions
- •Why stabilize before diagnosing?
- •What load-dependent causes would you suspect for a peak-only failure?
- •What is a cache stampede and how do you prevent it?
- •What resilience patterns prevent this class of incident?
Common mistakes
- •Debugging root cause while users are still broken instead of stabilizing first.
- •No kill switch, so the only option is a risky hotfix under pressure.
- •Frontend retry storms amplifying a backend overload.
- •No load testing, so peak behavior is never validated.
- •Skipping the blameless postmortem.
Performance considerations
- •Peak failures are fundamentally performance/capacity events. Caching, request dedup, connection pooling, and circuit breakers reduce load amplification. Retry-with-jitter prevents synchronized retry storms. Load testing surfaces the ceiling before users do.
Edge cases
- •Failure caused by a third-party dependency, not your code.
- •Cache stampede after a cache flush or deploy.
- •Rollback not possible (DB migration already applied).
- •Cascading failure across multiple services.
Real-world examples
- •Feature-flagging a failing checkout step into a degraded mode while diagnosing.
- •A cache stampede after deploy fixed with request coalescing + staggered TTLs.