What I Learned Building Production LLM Applications

Building with LLMs is deceptively easy to start and fiendishly hard to finish. The gap between a 20-line demo and a production system is larger than almost any other area of software I've worked in.

Here's what I've learned from shipping three real LLM-powered features over the past year.

The Prototype Trap

Every LLM application starts with a script that feels like magic:

import anthropic
 
client = anthropic.Anthropic()
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarise this document for me"}]
)
 
print(response.content[0].text)

This works. It works beautifully. And then you try to turn it into something real, and everything falls apart.

The hard parts are not the API call. They are:

Consistency — getting the same quality of output reliably, not just on your test cases
Cost — controlling token usage at scale without degrading quality
Observability — knowing why a response was bad when it is
Latency — users expect fast responses, even for complex tasks

The prototype proves the idea works. The engineering work begins after the prototype.

Prompt Versioning Matters More Than You Think

Prompts are code. Version them like code. The worst mistake I made early on was editing prompts in-place without any tracking — within two weeks I had no idea which version of a prompt had caused a quality regression I was investigating.

My current setup: prompts live in a prompts/ directory as .txt files, checked into git. A thin wrapper reads them at startup. This means:

Git blame tells me who changed what and when
I can roll back a prompt change independently of code changes
Reviewing prompt changes in PRs becomes a normal part of the workflow

Rate Limiting Is Not Optional

If you're exposing an LLM feature to users, you will get abuse. Whether it's a curious developer hammering the endpoint or someone trying to drain your API budget, rate limiting is essential from day one.

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(20, "1 h"),
  analytics: true,
});
 
const { success } = await ratelimit.limit(ip);
if (!success) {
  return new Response("Too many requests", { status: 429 });
}

Use a distributed store (Upstash Redis, not in-memory) so limits work across serverless instances.

Stream Everything

Users hate waiting for a response to appear all at once. Streaming — even for responses that complete in 2-3 seconds — dramatically improves perceived performance.

The Anthropic SDK makes streaming straightforward:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=messages,
) as stream:
    for text in stream.text_stream:
        yield text

On the frontend, consume Server-Sent Events and update the UI token by token. The effect feels like the model is "thinking" in real time, which users find engaging rather than frustrating.

Closing Thoughts

The models are now good enough that quality is rarely the bottleneck. The bottleneck is infrastructure — making LLM features reliable, observable, and cost-efficient at scale.

Treat your prompts as a first-class engineering artefact, measure everything, and always have a fallback. The rest is iteration.