How do you integrate an AI API (like OpenAI or Claude) into a frontend app
Never call the AI API directly from the browser — proxy through your own backend so the API key stays secret. The backend handles auth, rate limiting, prompt construction, and streaming; the frontend streams the response, renders incrementally, and handles loading/errors/cancellation.
The single most important rule: never call the AI provider directly from the browser. Everything else follows from that.
1. Architecture — proxy through your backend
Browser → your backend (BFF) → OpenAI/Claude API- The API key must stay server-side. Putting it in frontend code or env-exposed-to-client = anyone can read it from the bundle/network and run up your bill. This is non-negotiable.
- Your backend endpoint also lets you: authenticate your users, enforce your rate limits and quotas, construct/validate prompts, sanitize inputs, log/monitor, cache, and swap providers without a frontend change.
2. Streaming — essential for UX
AI responses are slow (seconds) and generated token-by-token. Don't make the user stare at a spinner.
- The provider supports streaming (SSE / chunked responses). Your backend proxies the stream through to the client.
- Frontend consumes it —
ReadableStreamfromfetch, anEventSource, or the provider's SDK streaming helpers — and renders tokens incrementally as they arrive. - Show a typing/cursor indicator; let the UI update progressively.
3. Frontend state & UX
- States: idle, loading/streaming, success, error — and partial (mid-stream) content.
- Cancellation — an
AbortControllerso the user can stop a long generation (and you stop paying for it). - Optimistic display of the user's message; append the assistant's streamed response.
- Markdown rendering — model output is usually markdown; render it safely (sanitize — model output is untrusted; treat it like user content for XSS).
- Conversation state — message history managed client-side and sent with each request (or referenced by a server-side thread id).
4. Errors, limits, cost
- Handle rate limits (429) and provider errors gracefully — backoff, retry where safe, clear messaging. (See AI-specific rate-limit handling.)
- Timeouts for stuck generations.
- Cost control — token limits, max output length, debounce, prevent spam; enforce per-user quotas on your backend.
- Latency — set expectations in the UI; streaming makes it feel faster.
5. AI-specific concerns
- Hallucinations — don't present output as authoritative; cite sources where possible; let users verify/edit.
- Safety — moderate inputs/outputs if user-facing.
- Prompt injection — treat user input as untrusted in prompt construction; don't let it override system instructions.
- Nondeterminism — same input, different output; design UI and tests around that.
How to answer
"The key decision: proxy through my own backend, never call the provider from the browser — the API key stays secret, and the backend owns auth, rate limiting, prompt construction, and provider abstraction. I'd stream the response (SSE/chunked) and render tokens incrementally since AI latency is high, with cancellation via AbortController, loading/error/partial states, and safe markdown rendering of the (untrusted) model output. Plus AI-specific handling: 429s, cost/token limits, prompt-injection safety, and not presenting hallucination-prone output as authoritative."
Follow-up questions
- •Why can't you call the AI API directly from the browser?
- •How do you implement streaming responses end to end?
- •Why must you sanitize AI-generated markdown output?
- •How do you control cost when integrating an AI API?
Common mistakes
- •Calling the provider directly from the frontend, exposing the API key.
- •Not streaming — making users wait on a spinner for seconds.
- •Rendering model output as raw HTML without sanitization.
- •No cancellation, so users can't stop (and stop paying for) a generation.
- •Ignoring rate limits, cost controls, and prompt-injection risk.
Performance considerations
- •Streaming dramatically improves perceived performance — first token in ~1s vs whole response in ~10s. The backend proxy adds a hop but enables caching and abstraction. Token/length limits control both latency and cost.
Edge cases
- •Stream interrupted mid-response (partial content + retry).
- •Provider rate limit or outage.
- •Very long generations needing timeouts/cancellation.
- •Prompt injection via user input.
- •Nondeterministic output complicating testing.
Real-world examples
- •A chat feature: backend BFF proxying Claude/OpenAI with streaming, frontend rendering tokens into sanitized markdown with a stop button.