How would you implement a robust frontend monitoring and logging system

Capture errors (window handlers + boundaries), performance (Core Web Vitals via RUM), and structured logs/breadcrumbs; enrich with context (user, route, release, session); sample and rate-limit; route to a backend (Sentry/Datadog); add session replay and alerting on SLOs. Mind privacy.

7 min read·~15 min to think through

A robust frontend monitoring system answers three questions in production: Is it broken? Is it slow? What were users doing when it happened? That means errors, performance, and behavioral context — collected, enriched, and routed somewhere actionable.

1. Error tracking

Global handlers: window.onerror / addEventListener('error') for uncaught errors and resource failures; unhandledrejection for promise rejections.
React error boundaries → report on componentDidCatch with the component stack.
Manual capture in try/catch around risky event handlers and async code (boundaries don't catch those).
Source maps uploaded to the backend so minified stacks are readable.
Deduplicate and group identical errors; track error rate, not just count.

2. Performance (RUM — real user monitoring)

Core Web Vitals — LCP, INP, CLS via the web-vitals library, reported from real sessions.
Custom marks/measures (performance.mark), API latency, route-transition timing, long tasks.
Report p75/p95, segmented by route, device, region — averages hide the pain.

3. Structured logging & breadcrumbs

Breadcrumbs — a rolling trail of recent actions (clicks, navigations, network calls, state changes) attached to each error so you can see how the user got there.
Structured logs — JSON with consistent fields, levels (debug/info/warn/error), not free-text console.log.
Network logging — failed requests, status codes, timing.

4. Context enrichment (what makes reports actionable)

Every event carries: release/version, route/URL, user/session id (or anonymized id), browser/device/OS, feature-flag state, viewport, connection type, timestamp. An error without context is nearly useless; "this error, on release 1.4.2, on the checkout route, for users on Safari" is a bug ticket.

5. Transport, sampling, reliability

Batch events; send via navigator.sendBeacon (survives page unload) or fetch with keepalive.
Sample high-volume data (e.g. 100% of errors, X% of performance/replay) to control cost and load.
Rate-limit so an error loop doesn't DoS your own ingestion or the user's network.
Buffer offline, flush on reconnect. Monitoring must never break or slow the app.

6. Tooling

Don't build the backend — use Sentry, Datadog RUM, LogRocket, New Relic, Grafana Faro. Add session replay (LogRocket/Sentry) to watch what happened. Wrap the vendor SDK in your own thin module so you can swap it.

7. Alerting & dashboards

Alert on SLOs: error rate spike, Core Web Vitals regression, a new error type, a crash-rate threshold — routed to Slack/PagerDuty.
Dashboards for error trends, vitals over releases, top errors.
Release tracking — tie metrics to deploys so you catch a bad release fast.

8. Privacy

Scrub PII before sending — mask inputs, redact tokens, anonymize ids in replay.
Respect consent (GDPR), data residency, and don't capture sensitive fields.

The framing

"Three pillars — errors, performance (RUM), and behavioral context (breadcrumbs/replay) — every event enriched with release/route/user/device, sampled and rate-limited, sent reliably via sendBeacon, routed to a tool like Sentry with alerting on SLOs and release tracking. And PII scrubbed throughout. The monitoring system itself must be lightweight and never able to break the app."

Follow-up questions

•Why enrich every event with release/route/device context?
•Why use sendBeacon instead of fetch for telemetry?
•How do you keep monitoring from impacting performance or flooding on an error loop?
•How do you handle PII in error reports and session replay?

Common mistakes

•Only catching React errors, missing window/async/resource errors.
•Reporting errors with no context — unactionable.
•No sampling or rate-limiting — cost blowup and error-loop floods.
•Sending PII/tokens to the monitoring backend.
•Not uploading source maps, so stacks are unreadable.

Performance considerations

•Telemetry must be lightweight — batch, sample, send async via sendBeacon, never block the main thread. Rate-limiting prevents an error storm from flooding the network. Session replay is heavy — sample it heavily.

Edge cases

•Errors during page unload (need sendBeacon/keepalive).
•Offline users — buffer and flush on reconnect.
•An error loop generating thousands of events.
•Source map mismatch after a deploy.

Real-world examples

•Sentry capturing errors + breadcrumbs + release tracking; web-vitals feeding a RUM dashboard.
•LogRocket session replay scrubbed of PII, linked from each error.

Senior engineer discussion

Seniors structure it as errors + performance + context, stress enrichment (release/route/user/device) as what makes reports actionable, and cover the operational realities — sampling, rate-limiting, sendBeacon, offline buffering, source maps, SLO-based alerting, release tracking. They insist monitoring be unable to harm the app and treat PII scrubbing as non-negotiable.