Don't Take Context for Granted
Two years ago, when we shipped the first version of HackerRank’s AI-powered Mock Interviewer, everyone was talking about prompt engineering. Just write better prompts, they said. “Give the model clear instructions.”
But it felt like solving a different problem, one that did not have a name yet.
The largest available context window at that time was 32K tokens. We started with an 8K model and, within weeks, hit our first context length errors. The obvious fix was to switch to a bigger model when conversations exceeded 7K tokens. But as I debugged why we were even hitting those limits, I realized something important: the problem wasn’t the size of the context window. It was what we were putting in it.
We did not have to just engineer prompts, but also the context itself. Which information to include, when to include it, how to structure it, what to summarize, and what to discard. This was a different discipline entirely, one that would become the foundation of every scalable AI application I have built since.
Back then, nobody called it “context engineering.” Now, it is the difference between a demo and a production system.
Context Window Illusion
Context feels infinite when you are processing single documents or short conversations. Production changes that illusion very quickly.
Here is what large context windows do not protect us from.
Cost explosion
Every token sent to the model has a cost, both on input and output. A 100K-token context on every request is not using the available capacity. It is spending the budget on information that often does not contribute to better outcomes.
Latency degradation
Models do not scale linearly with context length. A tenfold increase in context does not result in a tenfold increase in latency; it is often worse. In production systems, users are unlikely to tolerate long response times simply because context was not curated deliberately.
Quality deterioration
This is the most subtle failure mode. Models perform worse when buried in an irrelevant or weakly related context. As the number of tokens in the context window increases, the model’s ability to accurately recall and reason over specific information decreases.
This matches how LLMs work. Transformer-based architectures allow every token to attend to every other token in the context, resulting in n² pairwise relationships for n tokens. As context grows, attention is spread thinner. You can think of this as a fixed attention budget that gets depleted with each additional token.
State management chaos
When context grows without bounds, it becomes difficult to reason about application state. What information does the model currently have access to? What shaped its previous response? Without clear boundaries, these questions become hard to answer.
Why Context Engineering Isn’t Prompt Engineering
In the early days, much of the industry’s attention was on prompts. There were countless blog posts on few-shot examples, chain-of-thought reasoning, and role-based prompting. These techniques are useful, but they address only part of the problem.
Prompt engineering asks: How do I instruct the model?
Context engineering asks: What information does the model need, and when?
Consider a code review assistant.
A prompt might say:
”You are an expert code reviewer. Identify security vulnerabilities, performance issues, and maintainability concerns.”
Context engineering determines everything around that prompt:
Do we send the entire pull request or only the changed files?
Do we include full files or only the modified sections with surrounding context?
When do we inject relevant documentation or coding practices?
How do we maintain state across multiple files?
What information from earlier reviews should influence later ones?
The prompt remains relatively stable. The context is dynamic, stateful, and intentionally scoped.
As AI applications evolve from single-turn interactions to long-running, multi-turn systems, context engineering becomes increasingly central.
Context Engineering
After spending a few years building production AI systems, I have found that reliable applications tend to rest on three context-engineering pillars.
Summarization: Compressing History Without Losing Meaning
A common approach to conversation history is simple accumulation: append every message to the context. This works initially, and then fails abruptly.
Effective summarization is not truncation. It is signal preservation.
For a code review assistant handling a pull request with dozens of files, the LLM does not need verbatim memory of every earlier file. What matters is:
Architectural patterns in use
Security issues identified across files
Repeated coding-standard violations
Shared dependencies that were modified
Bugs that may cause cross-file breakage
Maintaining a running summary that is updated as the review progresses keeps the active context focused while preserving coherence.
History describes what happened.
Context captures what matters now.
For long-running tasks that approach context limits, summarizing and restarting with a distilled representation is often more reliable than continuing to append raw history. Many modern tools already apply this pattern implicitly.
Chunking
A code review assistant does not need the entire codebase in context for every decision.
Instead:
Parse the pull request into changed files and diff hunks
Index chunks with metadata such as file paths and change types
Retrieve only the chunks relevant to the current review focus
If the model is reviewing a Rails route that imports a utility function, pulling in that utility’s definition and usage examples is usually sufficient. Test suites, build scripts, and unrelated APIs add noise without improving the review.
This approach improves both token efficiency and output quality by narrowing the model’s focus to what is relevant for the current step.
The same principle applies beyond code: documentation, prior review comments, and best-practice references should be retrieved selectively rather than batch-loaded.
Increasingly, systems are shifting toward just-in-time context retrieval. Instead of preloading all information, agents maintain lightweight references and fetch details dynamically as needed. This mirrors how humans work: we rely on indexes and retrieval, not full memorization.
Stateful Context: Right Information, Right Time
One recurring failure pattern in production systems is treating context as a dumping ground: collect everything and hope the model infers structure.
Reliable systems make state explicit.
For a code review assistant, this might include:
Current review phase (initial scan, deep analysis, final summary)
Files already examined and their complexity
Issues identified by category
Patterns detected to avoid repetition
Cross-file dependencies that require coordination
Each state determines what enters context. Early phases load metadata and structure. Deep analysis loads specific diffs and related files. Final recommendations load summaries grouped by severity.
This approach designs context through awareness rather than accumulation.
A useful technique here is structured note-taking. The agent writes notes to external memory and selectively reintroduces them into context later. This enables persistence without continuous context growth.
Patterns That Scale
Several patterns consistently improve context quality in production systems:
Semantic compression: Do not just truncate directly; understand what information is semantically important and preserve that.
Hierarchical context: Maintain context at different levels of granularity. High-level summary always present, detailed information retrieved on-demand.
Temporal awareness: Weigh recent context more heavily than distant context.
Scope-based retrieval: Match context type to task phase. Different review phases need different contexts. Security analysis needs authentication flows and data validation code; performance review needs database queries and API calls; style review needs coding standards and formatting rules.
Explicit state tracking: Track progress outside the model instead of relying on inference.
Progressive disclosure: Let the LLM incrementally discover relevant context through exploration. When it identifies an imported function, it can fetch that function’s definition. When it finds a database query, it can retrieve the schema. Each interaction yields context that informs the next decision.
Multi-agent architectures: Delegate focused tasks to specialized agents with clean context windows and merge distilled results. Rather than one agent reviewing the entire PR, specialized agents handle focused tasks with clean context windows. The main agent coordinates with a high-level plan while subagents perform deep technical work or use tools to find relevant information. Each subagent might explore extensively, using tens of thousands of tokens or more, but returns only a condensed, distilled summary of its work.
Context-efficient tools: Design tools that are token-efficient and unambiguous. Make sure there is no bloating of tools, leading to confusion for the LLM on which tool to use.
Building With Context in Mind
Start with context budgets. Decide upfront: this feature gets at most X tokens of context. Then engineer to that constraint.
Instrument everything. Track what context is getting sent, what the model uses, and what it ignores.
Test context degradation. What happens when conversations get long? When users upload large documents? When does your knowledge base grow?
Design for stateful context. Your application has a state, conversation phase, user intent, and task progress. Use it to determine the context to be sent to the LLM.
Treat context as a first-class concern. Treat context as part of core architecture, not a post-hoc optimization.
Think in attention budget. Every token you add depletes the model’s ability to focus on what matters. Be ruthless about what earns its place in context.
Conclusion
We now have models with massive context windows, yet production experience repeatedly shows that smaller, better-curated contexts lead to more reliable systems.
Context engineering is not about working around limitations. It is about building systems that remain predictable, efficient, and understandable as they scale. It reflects a shift from asking “How do I prompt this?” to “What information does this system need at this moment?”
Context will not manage itself.
Do not take context for granted. Engineer it intentionally, measure it constantly, and optimize it relentlessly.
Your production systems and your users will thank you.

