How do you integrate an AI API (like OpenAI or Claude) into a frontend app

Never call the AI API directly from the browser — proxy through your own backend so the API key stays secret. The backend handles auth, rate limiting, prompt construction, and streaming; the frontend streams the response, renders incrementally, and handles loading/errors/cancellation.

7 min read·~15 min to think through

The single most important rule: never call the AI provider directly from the browser. Everything else follows from that.

1. Architecture — proxy through your backend

Browser → your backend (BFF) → OpenAI/Claude API

The API key must stay server-side. Putting it in frontend code or env-exposed-to-client = anyone can read it from the bundle/network and run up your bill. This is non-negotiable.
Your backend endpoint also lets you: authenticate your users, enforce your rate limits and quotas, construct/validate prompts, sanitize inputs, log/monitor, cache, and swap providers without a frontend change.

2. Streaming — essential for UX

AI responses are slow (seconds) and generated token-by-token. Don't make the user stare at a spinner.

The provider supports streaming (SSE / chunked responses). Your backend proxies the stream through to the client.
Frontend consumes it — ReadableStream from fetch, an EventSource, or the provider's SDK streaming helpers — and renders tokens incrementally as they arrive.
Show a typing/cursor indicator; let the UI update progressively.

3. Frontend state & UX

States: idle, loading/streaming, success, error — and partial (mid-stream) content.
Cancellation — an AbortController so the user can stop a long generation (and you stop paying for it).
Optimistic display of the user's message; append the assistant's streamed response.
Markdown rendering — model output is usually markdown; render it safely (sanitize — model output is untrusted; treat it like user content for XSS).
Conversation state — message history managed client-side and sent with each request (or referenced by a server-side thread id).

4. Errors, limits, cost

Handle rate limits (429) and provider errors gracefully — backoff, retry where safe, clear messaging. (See AI-specific rate-limit handling.)
Timeouts for stuck generations.
Cost control — token limits, max output length, debounce, prevent spam; enforce per-user quotas on your backend.
Latency — set expectations in the UI; streaming makes it feel faster.

5. AI-specific concerns

Hallucinations — don't present output as authoritative; cite sources where possible; let users verify/edit.
Safety — moderate inputs/outputs if user-facing.
Prompt injection — treat user input as untrusted in prompt construction; don't let it override system instructions.
Nondeterminism — same input, different output; design UI and tests around that.

How to answer

"The key decision: proxy through my own backend, never call the provider from the browser — the API key stays secret, and the backend owns auth, rate limiting, prompt construction, and provider abstraction. I'd stream the response (SSE/chunked) and render tokens incrementally since AI latency is high, with cancellation via AbortController, loading/error/partial states, and safe markdown rendering of the (untrusted) model output. Plus AI-specific handling: 429s, cost/token limits, prompt-injection safety, and not presenting hallucination-prone output as authoritative."

Follow-up questions

•Why can't you call the AI API directly from the browser?
•How do you implement streaming responses end to end?
•Why must you sanitize AI-generated markdown output?
•How do you control cost when integrating an AI API?

Common mistakes

•Calling the provider directly from the frontend, exposing the API key.
•Not streaming — making users wait on a spinner for seconds.
•Rendering model output as raw HTML without sanitization.
•No cancellation, so users can't stop (and stop paying for) a generation.
•Ignoring rate limits, cost controls, and prompt-injection risk.

Performance considerations

•Streaming dramatically improves perceived performance — first token in ~1s vs whole response in ~10s. The backend proxy adds a hop but enables caching and abstraction. Token/length limits control both latency and cost.

Edge cases

•Stream interrupted mid-response (partial content + retry).
•Provider rate limit or outage.
•Very long generations needing timeouts/cancellation.
•Prompt injection via user input.
•Nondeterministic output complicating testing.

Real-world examples

•A chat feature: backend BFF proxying Claude/OpenAI with streaming, frontend rendering tokens into sanitized markdown with a stop button.

Senior engineer discussion

Seniors lead with the security boundary — proxy through a backend, key stays server-side — and treat that backend as the place for auth, rate limiting, prompt construction, and provider abstraction. They make streaming + incremental rendering + cancellation core, sanitize untrusted model output, and raise AI-specific concerns: cost, 429s, prompt injection, hallucinations, nondeterminism.