Build a real-time order tracker with WebSockets
WebSocket connection authenticated at handshake, subscribed to a per-order channel. Server pushes status events; client merges into local state (TanStack Query cache or a simple reducer). Handle reconnection with backoff + resubscription, request a snapshot on connect to fill missed events, dedupe by event id, and fall back to polling on persistent failure. Consider SSE for one-way streams — simpler infrastructure, automatic reconnect.
Real-time tracker has three system properties:
- Push — server drives updates (not polling).
- Per-order subscription — only the orders the user cares about.
- Resumable — reconnect + catch-up on missed events.
Transport choice
| Option | Pros | Cons |
|---|---|---|
| WebSocket | Bidirectional, low overhead per message, mature | Need sticky sessions / pub-sub backend; reconnect logic by hand |
| SSE (Server-Sent Events) | One-way, automatic reconnect, plain HTTP, easy CDN | One-way only; some proxies buffer |
| Long polling | Works everywhere | Inefficient; latency higher |
| HTTP/3 + WebTransport | Modern, multiplexed, datagrams + streams | Browser support patchy in 2026 |
For an order tracker (server pushes, client mostly listens), SSE is the right default. If you also need client → server messages (live chat with the driver, "delivered" confirmation from rider), WebSocket.
Architecture
Client ── WS ──► Edge gateway ── pub/sub ──► Order service
(sticky to a (publishes events
pod via cookie) on order changes)
Redis Pub/Sub, NATS, Kafka, or
managed service (Pusher, Ably, Liveblocks)The gateway holds the long-lived connection. The order service is stateless and publishes events to a pub/sub layer. The gateway subscribes to the topic for each connected user, forwards events.
Authentication
WebSocket handshake is an HTTP upgrade — pass auth via:
- Cookie (HttpOnly auth cookie) — automatic, but watch CSRF on the upgrade endpoint.
- Token in Sec-WebSocket-Protocol subprotocol header — works, slightly weird.
- Query param — avoid, logged everywhere.
- First-message auth — connect, send
{ type: "auth", token }, server rejects if invalid.
Re-validate auth on token rotation; don't keep an indefinitely-old session alive.
Subscription model
client: { type: "subscribe", channel: "order:abc123" }
server: { type: "ok", channel: "order:abc123" }
server: { type: "event", channel: "order:abc123", seq: 42, payload: {...} }Per-channel sequence numbers (seq) are the single most important piece. They let you detect missed events on reconnect.
Reconnection
class RealtimeClient {
ws: WebSocket | null = null;
backoff = 1000;
lastSeq: Map<string, number> = new Map();
subscriptions = new Set<string>();
connect() {
this.ws = new WebSocket(URL);
this.ws.onopen = () => {
this.backoff = 1000;
for (const ch of this.subscriptions) {
const since = this.lastSeq.get(ch);
this.send({ type: "subscribe", channel: ch, since });
}
};
this.ws.onmessage = (e) => this.handle(JSON.parse(e.data));
this.ws.onclose = () => {
this.ws = null;
setTimeout(() => this.connect(), this.backoff);
this.backoff = Math.min(this.backoff * 2, 30_000);
};
}
handle(msg: any) {
if (msg.type === "event") {
const last = this.lastSeq.get(msg.channel) ?? 0;
if (msg.seq <= last) return; // dedupe
this.lastSeq.set(msg.channel, msg.seq);
this.emit(msg.channel, msg.payload);
}
}
}- Exponential backoff with jitter — don't thunder-herd reconnect after a server outage.
sinceon resubscribe — server replays events newer thansincefrom its buffer (Redis Streams, Kafka offset, in-memory ring buffer).- Heartbeat —
ping/pongevery 30s. Detect dead connections faster than TCP keepalive (which can be minutes). - Online/offline events —
window.addEventListener("online", reconnectImmediately).
Catch-up: snapshot + delta
When reconnecting OR loading a new order page, do snapshot + subscribe in one round-trip:
client: GET /orders/abc123 → returns full state at seq=42
client: WS subscribe since=42 → server pushes events seq>42Otherwise, events arriving while the snapshot was in flight may be lost or duplicated.
State on the client
Two viable patterns:
1. TanStack Query as the cache.
const { data } = useQuery({ queryKey: ["order", id], queryFn: fetchOrder });
useRealtime(`order:${id}`, (event) => {
queryClient.setQueryData(["order", id], applyEvent);
});Cache holds the canonical state; subscriptions mutate it; UI re-renders.
2. Custom reducer.
const [state, dispatch] = useReducer(orderReducer, initial);
useEffect(() => realtime.subscribe(`order:${id}`, dispatch), [id]);Fits well when state is complex (multi-step status with derived UI).
Failure modes
- Server hard kill / deploy — clients reconnect to a new pod; pub/sub durability (Kafka, Redis Streams) ensures no events lost.
- Network blip — backoff handles it;
sinceresubscribe catches up. - Stale tab — heartbeat detects; reconnect refreshes data.
- Stale token — server closes the connection on token rotation; client refreshes auth, reconnects.
- CDN / corporate proxy strips WebSockets — fallback to long-polling or SSE.
Scale at the gateway
- One long-lived TCP connection per user, often per tab. 100k concurrent users = 100k sockets. A modern Node/Go gateway can do 100k+ per box.
- Sticky sessions — once a user connects to pod X, subsequent reconnects should hit pod X to use the same buffer. Load balancer sticky cookies; or, decouple — gateways are stateless, pub/sub holds the events.
- Fan-out — order updates may fan out to thousands of subscribers (a popular live event). Use a hierarchical pub/sub; don't direct-publish from the order service to gateways.
- Backpressure — if a client is slow, the queue on the gateway grows. Drop / coalesce updates rather than OOMing.
UI
- Connection state indicator — connected, reconnecting, offline.
- Optimistic updates for outgoing actions (e.g., "cancel order") with rollback on server reject.
- Smooth animations between status changes — status moves are visually meaningful.
- Polling fallback — if WS fails 3× consecutively, fall back to GET /orders/:id every 30s.
Build vs buy
| Need | Build | Buy |
|---|---|---|
| Internal tool, 1k concurrent users | Build (ws + Redis pub/sub) | overkill |
| Customer-facing, 100k+ concurrent | Hard but possible | Pusher, Ably, Liveblocks, AWS IoT, Convex |
| Collaborative editing | Build is very hard | Yjs + Hocuspocus / Liveblocks |
Buy when realtime isn't your differentiation. Build when ops are part of your competence.
Senior framing. The interviewer wants: (1) transport choice with reason, (2) seq + snapshot for resumability, (3) reconnect with backoff + heartbeat, (4) pub/sub for fan-out, (5) graceful degradation. The "we use WebSockets" answer is shallow; the architecture above is senior.
Follow-up questions
- •Why is sequence number on the wire the most important detail?
- •When would you pick SSE over WebSockets?
- •How do you avoid losing events between snapshot and subscription?
- •What's the scaling bottleneck — connections, fan-out, or pub/sub?
Common mistakes
- •Not deduping events on reconnect → double-applied state changes.
- •Reconnecting without `since` → missed events.
- •Auth via query param → token leaks into logs.
- •No heartbeat → silent dead connections.
Performance considerations
- •Coalesce updates server-side — don't send a tick per pixel of progress.
- •Use binary protocols (msgpack, protobuf) for high-rate channels.
- •Per-tab connection multiplied by users can blow out gateway file-descriptor limits.
Edge cases
- •Corporate proxies that buffer SSE — disable buffering with `X-Accel-Buffering: no`.
- •Mobile Safari closes WS in background — reconnect on visibility change.
- •Browser tab throttled in background — events queue, deliver on focus.
Real-world examples
- •Uber, DoorDash live tracking. Stripe Connect updates. Linear's live sync. Slack message streams.