How would you implement a robust frontend monitoring and logging system
Capture errors (window handlers + boundaries), performance (Core Web Vitals via RUM), and structured logs/breadcrumbs; enrich with context (user, route, release, session); sample and rate-limit; route to a backend (Sentry/Datadog); add session replay and alerting on SLOs. Mind privacy.
A robust frontend monitoring system answers three questions in production: Is it broken? Is it slow? What were users doing when it happened? That means errors, performance, and behavioral context — collected, enriched, and routed somewhere actionable.
1. Error tracking
- Global handlers:
window.onerror/addEventListener('error')for uncaught errors and resource failures;unhandledrejectionfor promise rejections. - React error boundaries → report on
componentDidCatchwith the component stack. - Manual capture in
try/catcharound risky event handlers and async code (boundaries don't catch those). - Source maps uploaded to the backend so minified stacks are readable.
- Deduplicate and group identical errors; track error rate, not just count.
2. Performance (RUM — real user monitoring)
- Core Web Vitals — LCP, INP, CLS via the
web-vitalslibrary, reported from real sessions. - Custom marks/measures (
performance.mark), API latency, route-transition timing, long tasks. - Report p75/p95, segmented by route, device, region — averages hide the pain.
3. Structured logging & breadcrumbs
- Breadcrumbs — a rolling trail of recent actions (clicks, navigations, network calls, state changes) attached to each error so you can see how the user got there.
- Structured logs — JSON with consistent fields, levels (debug/info/warn/error), not free-text
console.log. - Network logging — failed requests, status codes, timing.
4. Context enrichment (what makes reports actionable)
Every event carries: release/version, route/URL, user/session id (or anonymized id), browser/device/OS, feature-flag state, viewport, connection type, timestamp. An error without context is nearly useless; "this error, on release 1.4.2, on the checkout route, for users on Safari" is a bug ticket.
5. Transport, sampling, reliability
- Batch events; send via
navigator.sendBeacon(survives page unload) orfetchwithkeepalive. - Sample high-volume data (e.g. 100% of errors, X% of performance/replay) to control cost and load.
- Rate-limit so an error loop doesn't DoS your own ingestion or the user's network.
- Buffer offline, flush on reconnect. Monitoring must never break or slow the app.
6. Tooling
Don't build the backend — use Sentry, Datadog RUM, LogRocket, New Relic, Grafana Faro. Add session replay (LogRocket/Sentry) to watch what happened. Wrap the vendor SDK in your own thin module so you can swap it.
7. Alerting & dashboards
- Alert on SLOs: error rate spike, Core Web Vitals regression, a new error type, a crash-rate threshold — routed to Slack/PagerDuty.
- Dashboards for error trends, vitals over releases, top errors.
- Release tracking — tie metrics to deploys so you catch a bad release fast.
8. Privacy
- Scrub PII before sending — mask inputs, redact tokens, anonymize ids in replay.
- Respect consent (GDPR), data residency, and don't capture sensitive fields.
The framing
"Three pillars — errors, performance (RUM), and behavioral context (breadcrumbs/replay) — every event enriched with release/route/user/device, sampled and rate-limited, sent reliably via sendBeacon, routed to a tool like Sentry with alerting on SLOs and release tracking. And PII scrubbed throughout. The monitoring system itself must be lightweight and never able to break the app."
Follow-up questions
- •Why enrich every event with release/route/device context?
- •Why use sendBeacon instead of fetch for telemetry?
- •How do you keep monitoring from impacting performance or flooding on an error loop?
- •How do you handle PII in error reports and session replay?
Common mistakes
- •Only catching React errors, missing window/async/resource errors.
- •Reporting errors with no context — unactionable.
- •No sampling or rate-limiting — cost blowup and error-loop floods.
- •Sending PII/tokens to the monitoring backend.
- •Not uploading source maps, so stacks are unreadable.
Performance considerations
- •Telemetry must be lightweight — batch, sample, send async via sendBeacon, never block the main thread. Rate-limiting prevents an error storm from flooding the network. Session replay is heavy — sample it heavily.
Edge cases
- •Errors during page unload (need sendBeacon/keepalive).
- •Offline users — buffer and flush on reconnect.
- •An error loop generating thousands of events.
- •Source map mismatch after a deploy.
Real-world examples
- •Sentry capturing errors + breadcrumbs + release tracking; web-vitals feeding a RUM dashboard.
- •LogRocket session replay scrubbed of PII, linked from each error.