What AI Engineering Actually Looks Like in Production

Jun 15, 2025

Over the past year, we have seen the term AI Engineer used frequently. It's tempting to think it's just a fancier title for someone writing prompts. But if you have built a real LLM-powered system that ships to users, you know how quickly things go from clever to chaotic.

This post is a reflection on what I have learned about the AI Engineer role, not from hype cycles, but from building and shipping LLM-powered systems. It’s about the day-to-day engineering muscle it takes to turn an idea into something resilient, useful, and observable. Along the way, I have come to see this as a new kind of engineering discipline; one that blends software fundamentals with a feel for probabilistic interfaces, product intuition, backend systems, and user trust.

How AI Engineering Differs from ML Engineering

Traditional ML follows a Data → Model → Product pipeline. AI engineering reverses this: Product → Data → Model. You start with the product experience and work backwards.

The focus shifts from model training to system composition. There are parts of assembling, prompting, evaluating, and routing models, not fine-tuning them. This is similar to how backend engineers compose microservices rather than building monoliths, but with the added complexity of probabilistic outputs.

It also involves tighter feedback loops. There are rapid iterations based on user interactions, not just batch metrics. This is more like web development's A/B testing culture than traditional ML's training cycles.

Unlike traditional software engineering, where a function either works or doesn't, AI engineering operates in a permanent gray area. Your system might work perfectly for 95% of cases, degrade gracefully for 4%, and catastrophically fail for 1%, and that 1% might be the cases your users care about most.

This creates three core challenges that define AI Engineering:

Reliability: Users expect deterministic experiences from probabilistic systems.
Economics: Every interaction costs real money, and costs scale unpredictably.
Observability: Traditional monitoring tells you almost nothing about why an LLM failed.

Prompt Engineering

At the start, a good prompt feels like a clever trick. You tweak a few words, get better output, and move on. But once your system hits production, with real users, edge cases, and expectations, prompts stop being experiments. They become product-critical interfaces and need to be treated the same way as code.

And like any code, prompts need to be versioned, tested, observed, and maintained.

Here is what that looks like in practice:

Templating and Reusability: Writing prompts by hand doesn’t scale. Templating systems like Jinja2 can be used to compose reusable blocks of personality and rules into modular prompt templates. This allows for the separation of concerns, avoids duplication, and supports scalable experimentation. A small change in tone or context structure shouldn’t require rewriting 20 different variants. This keeps your prompts DRY and modular.
Version Control and Change Tracking: Prompt templates can live in source control, often alongside application code. Every change is tracked with diffs, and changes undergo review like any other pull request. This helps in maintaining changelogs so we know when a prompt changed, why, and what its impact was. This makes it easier to trace regressions from user-reported bugs.
Few-shot and Zero-shot Examples: Few-shot prompting often improves instruction following, but the examples need to be diverse and updated consistently. These examples also require maintenance, as they might go stale as product requirements change or as new failure modes emerge.
Snapshot Testing: Every prompt version is snapshot-tested against a fixed set of inputs to ensure behavioral stability. This means generating outputs using the new prompt, comparing them against expected outputs (or embeddings of them), and flagging deviations beyond a defined threshold. This catches subtle drifts early.
Prompt Versioning: Multiple versions of a prompt can be managed via feature flags or routing rules. This allows A/B testing, gradual rollout, or fast rollback if a new version causes regressions.

When I started building applications with LLMs, the prompts were duplicated in multiple places, and as the system scaled, maintenance became an overhead; that’s where the above practices helped.

Context Length

LLMs have a strict context window of 8k, 32k, 128k tokens, depending on the model. Theoretically, that sounds like a lot, but in practice, it runs out fast. Once we start adding system prompts, user history, retrieved documents, task-specific inputs, and metadata, we might be racing the token limit.

With limited tokens, what you include and exclude matters. Similar to how overfitting and underfitting work in machine learning (ML), understanding the right context for a given task is important, as more isn't always better.

Here is what that looks like in practice:

Identifying what Context Matters: Not all tokens will be equally important. Understanding which parts of the prompt directly influence the model’s output and which parts are noise is important. This comes down to experimenting: removing one instruction might degrade the answer, while trimming three paragraphs of user history does nothing.
Retrieval Scoring and Ranking: If we are using systems like RAG, we might end up retrieving more than the context can fit. To avoid this, retrieving information based on semantic relevance measured by cosine similarity can help in avoiding the token stuffing problem.
Context Truncation Strategies: If the limit is hit, there are two approaches; either do a hard truncation or truncate based on the semantic structure to avoid cutting a paragraph in between.

RAG Systems

The core idea is to take a user query, retrieve relevant information from a custom knowledge base, and include that context in the prompt.

Here’s what typically goes into building a RAG setup:

Choosing the Right Embedding Model: Embedding quality has a big impact on what documents get retrieved, which in turn affects the AI system’s performance; hence, choosing the right model is crucial to boost performance. One of the crucial steps is also to ensure the model used for generating embeddings is the same for the user query input and for the underlying information.
Index Management: As the data changes, embeddings and indexes need periodic updates to give the right context to the LLM.
Retrieval & Chunking: A simple vector search based on cosine similarity is a strong baseline, but results can be refined with re-ranking models or keyword scoring to boost precision. To select the right context, we can chunk the information based on the token or semantic chunking.

Evaluation

Evaluating LLM systems is less about asserting the right answers since the outputs are usually open-ended, but they are more about understanding behavior. Unlike traditional software, where a test case either passes or fails, LLMs operate in a wide spectrum where responses can be technically correct but unhelpful, or partially wrong yet directionally useful.

Because of this, evaluating an LLM usually is a combination of automated metrics, LLM as judges, and human feedback to get a clearer picture of quality and drift over time. Evals for an LLM are still an evolving space.

Here is what that looks like in practice:

Golden Datasets: These are curated sets of inputs with high-quality expected outputs, often reviewed or written by humans or gathered from users using the product. They act as ground truth baselines for evaluating new models, prompt versions, or any config updates. Updating this set periodically ensures it stays representative of real-world usage.
LLMs as Judges: In many cases, a secondary model can be used to evaluate primary model outputs by asking questions like: “Does this follow the instruction?”, “Is it factually accurate?”, or “How helpful is this response?” While not perfect, LLMs-as-judges offer fast and scalable assessments when paired with clear criteria and examples. While it sounds efficient, using one LLM to evaluate another can have systematic biases that can invalidate your entire evaluation pipeline. This includes bias in terms of length, style of responses, and confidence.
Semantic Similarity: Metrics like cosine similarity can be used to compare outputs against the high-quality outputs from the golden dataset. These are useful in regression tests, where we want to confirm that behavior hasn’t unintentionally drifted, even if wording has changed.

While the above three sound efficient, they are still evolving, and automating LLM evaluation is still a challenging technical problem.

Below are the two approaches that can be practiced from the start:

Human-in-the-loop: You will want humans reviewing model outputs regularly, especially on critical user flows or ambiguous completions. This helps catch nuanced issues like tone, missed edge cases, or unexpected model shifts. This also involves sometimes executing the prompt 10-20 times with the same inputs to make sure that we get good results each time.
User-Tagged Feedback Loops: Collecting feedback directly from users, whether through thumbs up/down or feedback comments, is one of the best ways to understand real-world quality. This feedback can be used to enrich eval datasets and train internal judge models to align closer with user expectations.

Qualitative and Quantitative Metrics

Once LLM is deployed, measuring quality isn’t as simple as tracking accuracy or latency. Two answers might look very different, but both might be correct. Or a response could be factually accurate but completely miss the user’s intent.

This is why both quantitative and qualitative metrics matter.

Here’s how this typically looks in practice:

Quantitative Metrics: These are the numbers you can graph and track over time. Common examples include:
- Overall Latency: time per inference, useful for tracking user experience.
- Time to first token (TTFT): the time a user has to wait before they see the first word/character/token.
- Token usage: input/output tokens per call, helpful for understanding cost implications.
- Instruction adherence: measured via LLM judge scores or simple heuristics.
- User Feedback: a metric to track the sentiment of users
Qualitative Metrics: This is where human insight comes in. These include understanding whether the model made something up, were the responses were helpful, and whether it matches users expectations.
Feedback Integration: One of the ways I think that helps is to merge the above two loops, like tracking a drop in instruction adherence/user feedback (quant), might trigger a human review (qual), or even upvoted completions from users can be added to golden datasets or used to fine-tune LLM-as-judge prompts or repeated tags of too verbose can inform prompt compression or instruction tweaking.

Model Selection

Choosing the right model is like choosing the right tech stack. There is no one-size-fits-all.

Here’s how this looks in practice:

Matching model to task: Different models have different strengths. Some are better at structured tasks like classification or code generation, others handle open-ended reasoning more effectively, and some are optimized for multimodal inputs like images. For one of the features I was building, GPT-4 was hallucinating with JSON responses, and moving to GPT-4o with structured outputs improved the accuracy. Understanding these strengths and doing small-scale evaluations can help you match the model to the specific job at hand.
Token limits and cost trade-offs: Each model comes with constraints on input/output tokens, rate limits, and cost per token. These impact both latency and scalability. For example, a model with high accuracy but a 10x cost may not make sense for a high-throughput endpoint. Production choices often come down to balancing capability with cost per call, rate limits, and total monthly budget.
Fallback and multi-provider strategies: Relying on a single model or provider can be risky. Network issues, API outages, or pricing changes can all affect uptime. Many robust systems are built with feature flags and routing layers that allow calls to be dynamically shifted between providers depending on availability.

Debugging Hallucinations

Traditional debugging assumes you can trace through your code to understand what went wrong. LLM debugging often feels like detective work.

Even when everything looks fine but LLMs sometimes say things that are just wrong. Debugging these cases is about understanding where the pipeline broke down.

Here are some of the things I tried:

Tracing the full input stack: The first step is always to capture the full input context prompt, system input, and user input. Without this, debugging is just guesswork. Going through the invocation history in sequence has helped me a lot in debugging the issue.
Investigating retrieval mismatches: Sometimes the model hallucinates simply because it wasn’t shown the right context. This can happen due to wrong queries, wrong retrieval of information. Debugging retrieval involves logging and validating whether the correct document was even eligible to be retrieved.
Session Replay tooling: Unlike traditional bugs, LLM failures are often non-deterministic. The same input can produce different outputs, making reproduction and fixing extremely difficult. Having tooling that can replay a request with different prompt versions or see the user’s exact request, the model choices of the system, or context configurations is invaluable.

Bringing It All Together

Building AI systems that go beyond prototypes means thinking holistically from how prompts are designed to how systems are deployed, evaluated, and maintained in production. Working with LLMs is non-deterministic, and one of the biggest wins is getting comfortable with it.

System Design Considerations: Just like with microservices, designing LLM systems involves making trade-offs between latency, context size, caching, and API dependencies. For example, whether to precompute embeddings or compute on the fly affects the response time.
Being cost-aware: Cost is part of system design. Token-heavy prompts, unnecessarily large context windows, or routing all requests to the most expensive model can tank efficiency and need to be thought of before making the system live.
Observability and Debuggability: Every user interaction with an LLM should be traceable to a specific prompt version, context snapshot, model, and input. Replay tooling, logs, debug hallucinations, or sudden behavior changes. While this sounds efficient, the observability stack for AI systems is still evolving.
Feedback Loops: Qualitative user feedback, whether explicit thumbs down or implicit drop-offs, should inform prompt tweaks, retrieval tuning, and even product decisions.
Canary Deployments: New prompts or model configurations are first rolled out to a small percentage of users. This helps detect regressions or unintended behaviors before full release.

Building LLM systems that go beyond prototypes requires the same engineering discipline as any production system, just adapted for probabilistic outputs. The difference between a clever demo and a reliable product isn't the underlying model; it's the engineering practices around it. Your 1% failure rate might be your users most important use cases. AI engineering is still a young discipline, but the principles are becoming clear: building AI systems that users trust and depend on, not just impressive demos that break in production. The tools will keep evolving, but the engineering mindset remains constant: build systems that work reliably, cost-effectively, and transparently, even when the underlying technology is inherently uncertain.

The Engineer's Notebook

Discussion about this post