How do you stream AI responses to the UI in real time
The API returns a streaming response (SSE or chunked fetch). Read the ReadableStream from response.body, decode chunks, parse the token deltas, and append to state as they arrive. Handle partial chunks, abort/cancel, errors mid-stream, auto-scroll, and a 'stop generating' control.
Streaming AI output is about consuming a response that arrives incrementally and rendering it as it comes — instead of waiting for the whole thing.
The transport
LLM APIs stream via Server-Sent Events (SSE) or chunked HTTP — the response body is a stream of small chunks, each carrying a token delta. On the client you read it with the fetch + ReadableStream API:
const res = await fetch("/api/chat", {
method: "POST",
body: JSON.stringify({ messages }),
signal: controller.signal, // for cancellation
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// SSE: events are separated by \n\n; lines start with "data: "
const lines = buffer.split("\n");
buffer = lines.pop(); // keep the last partial line
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const data = line.slice(6);
if (data === "[DONE]") return;
const delta = JSON.parse(data).choices?.[0]?.delta?.content ?? "";
setMessage((m) => m + delta); // append the token to state
}
}The details interviewers grade
- Partial chunks — a network chunk does not align to event/token boundaries. You must buffer and only parse complete lines/events, carrying the leftover partial into the next read. Forgetting this corrupts the output.
TextDecoder({ stream: true })— so multi-byte UTF-8 characters split across chunks decode correctly.- Cancellation — wire an
AbortControllerto a "Stop generating" button; abort the fetch to halt the stream. - Errors mid-stream — the stream can fail after it started. Catch, show what arrived plus an error, allow retry.
- Rendering performance — appending on every token causes a render per token. For fast streams, batch updates (e.g.
requestAnimationFrameor a small buffer) so you don't thrash. Memoize already-rendered markdown. - UX — auto-scroll to follow the output (but stop if the user scrolls up), a typing cursor, disable input while streaming.
- Markdown — render incrementally; handle incomplete markdown/code fences gracefully.
Why stream at all
Time-to-first-token is far shorter than time-to-full-response — streaming makes the app feel responsive and lets users read as it generates and stop early. It's a perceived-performance win.
The framing
"The API sends a streaming response — SSE or chunked — so on the client I read response.body as a ReadableStream, decode each chunk with a streaming TextDecoder, and append token deltas to state as they arrive. The non-obvious parts: network chunks don't align to token boundaries, so I buffer and only parse complete events; I wire an AbortController to a stop button; I handle mid-stream errors; and I batch renders so a fast token stream doesn't cause a render per token. Plus UX — auto-scroll, typing cursor, disabled input while generating."
Follow-up questions
- •Why do you need to buffer partial chunks?
- •How do you implement a 'stop generating' button?
- •How do you avoid a re-render on every single token?
- •What happens if the stream errors halfway through?
Common mistakes
- •Assuming each network chunk is a complete token or event.
- •Not using TextDecoder's stream option — breaking multi-byte characters.
- •Re-rendering on every token with no batching — UI jank.
- •No cancellation, so users can't stop a long generation.
- •Not handling errors that occur after the stream started.
Performance considerations
- •Appending state per token can cause hundreds of renders — batch with rAF or a buffer. Memoize already-rendered markdown so only the streaming tail re-parses. Auto-scroll work should be throttled.
Edge cases
- •A token/event split across two network chunks.
- •Multi-byte UTF-8 character split across chunks.
- •Stream errors or connection drops mid-response.
- •User navigates away while streaming — must abort.
- •Incomplete markdown/code fence at the current cursor.
Real-world examples
- •ChatGPT/Claude-style chat UIs rendering tokens as they generate.
- •AI code assistants streaming completions into an editor.