In this article, you will learn how context engineering and memory engineering solve different problems in agentic AI systems, and how the two disciplines meet at the point where retrieved memory enters the context window.
Topics we will cover include:
- What context engineering involves, including selective inclusion, structural placement, and compression, and why it matters for reasoning quality within a single inference call.
- What memory engineering involves, including write policy design, storage layer selection, retrieval strategy, and maintenance, and how these shape long-term reliability.
- How memory and context engineering meet at the retrieval boundary, and the two most common failure modes that occur when this boundary is not managed well.
With that framing in place, here’s how each discipline works.

Introduction
As AI agents move into longer workflows and multi-session use cases, a familiar pattern emerges. Constraints get dropped mid-task, retrieved information resurfaces when it shouldn’t, and context from an earlier step bleeds into the current one. The failures are hard to pinpoint because no single component is obviously at fault.
Most of the time, the problem lies in two areas that get built together, conflated, or skipped: context engineering and memory engineering. They are related but distinct, fail in different ways, and require different systems to get right.
This article covers the core decisions behind each discipline and where they interact:
- What context engineering involves and the specific decisions that determine whether an agent reasons well within a single call
- What memory engineering involves and how write policy, storage, retrieval, and maintenance each affect long-term reliability
- How the two disciplines share a boundary at retrieval time and what it takes to manage that boundary well
Understanding both, separately and together, is what determines whether an agent holds up across real workloads.
An Overview of Context and Memory Engineering
Context engineering covers the design of a single inference call: what to include, what to compress, where to place things, and what to discard. Everything in scope is ephemeral; when the call ends, the window clears.
Memory engineering focuses on what survives beyond a single interaction with a model. It encompasses the systems and policies responsible for writing, storing, retrieving, updating, and governing information so that future interactions can make use of it. When an agent recalls information from a previous session, coordinates with another agent, or applies a user preference learned days or weeks earlier, it is relying on memory engineering rather than context engineering.
While context engineering determines what information is available to the model during a specific request, memory engineering determines what information persists across requests and how that information is maintained, retrieved, and trusted over time. Here’s an overview:
| Aspect | Context Engineering | Memory Engineering |
|---|---|---|
| Scope | One inference call | Across calls, sessions, agents |
| Where data lives | Inside the model’s active window | External stores: vector DB, K/V, relational |
| Primary problem | What to include and how to arrange it | What to persist, retrieve, and trust |
| Fails when | Window fills, placement is wrong, noise overwhelms signal | Retrieval misses, staleness, poisoning, no write policy |
| Engineering surface | Prompt structure, compression, token budgeting | Storage schema, retrieval strategy, write and update policies |
| Lifespan of data | Duration of one LLM call | Depends on the memory type |
Context Engineering: Assembling the Optimal Context Window
For an agent running a multi-step workflow, every inference call assembles a context window from multiple sources: system prompt, task description, conversation history, tool outputs, retrieved documents, subagent summaries. Context engineering is the set of decisions that determine what each component contributes, in what form, and in what position.
Selective Inclusion
Not everything available should enter the context. A database query returning hundreds of rows, a web search returning five complete articles, a code executor logging verbose output — all of these bloat the window and reduce reasoning quality before the token limit is reached. The decision about what gets included verbatim, what gets compressed to key facts, and what gets dropped is a design choice, not a default.
Structural Placement
Where information sits in the window affects how reliably the model uses it. Models attend more strongly to content at the beginning and end of long contexts, with material in the middle receiving significantly less weight. This is known as the “lost in the middle” effect.
Hard constraints and task-critical instructions belong at the top of the window. Retrieved information that is most relevant to the current task should be placed near the end of the context window.
The current user query or task should typically follow the retrieved information, positioning both the relevant context and the immediate objective as close as possible to the generation point. This arrangement increases the likelihood that the model will effectively use the retrieved information when producing its response.

Context Engineering Overview
Compression on Arrival
Tool outputs should be compressed after a call returns, not after the window fills. A raw API response carrying 3,000 tokens, of which the agent needs only 150, should be summarized before it enters context for the next step. Waiting until the window is full and then scrambling to truncate is reactive management of a problem that compression at the source prevents.
Conversation History Management
Conversation history grows faster than any other context component. For long-running agents, carrying the full history into every call makes every subsequent inference more expensive and less reliable. A compression strategy — rolling window, hierarchical summarization, or structured state extraction — should be applied at defined intervals, not when the window overflows.
Memory Engineering: Designing Persistent AI Memory Systems
Once an inference call completes, memory engineering determines what deserves to persist and under what conditions it gets used again. This covers four distinct concerns: what to write, where to store it, how to retrieve it, and how to keep it accurate over time.
Write Policy Design
Write policy design is one of the most overlooked aspects of memory engineering, yet it has a disproportionate impact on memory quality over time. While retrieval systems often receive the most attention, retrieval quality is ultimately constrained by what enters the memory store in the first place.
A well-defined write policy specifies:
- What events trigger a write to memory
- Which information is eligible for storage
- The format in which information is stored, such as raw text, structured records, extracted facts, or summaries
- The confidence or validation requirements for accepting new entries
- Which agents, tools, or system components are permitted to write to specific memory namespaces
- How updates, corrections, and conflicting information are handled
- Retention rules, expiration policies, and time-to-live (TTL) requirements for different memory types
Without explicit write policies, systems often default to storing too much information, assigning equal trust to all entries, and retaining data indefinitely. Over time, low-value and outdated memories accumulate, signal-to-noise ratios decline, and retrieval quality degrades. The result is a memory system that grows continuously while becoming progressively less useful.
Storage Layer Selection
Different memory types serve different purposes and require different storage backends. The choice of backend also constrains which retrieval strategies are available.
| Memory Type | What It Stores | Storage Backend | Retrieval Method |
|---|---|---|---|
| Working | Active task state, intermediate results | In-memory or short-lived K/V (Redis) | Direct key lookup |
| Episodic | Past interactions, task runs, decisions | Vector store (Pinecone, Weaviate, Chroma) | Semantic similarity search |
| Semantic | Persistent facts, user preferences, domain knowledge | Vector store + K/V hybrid | Semantic search or exact key |
| Procedural | Learned workflows, successful action patterns | Structured store or prompt injection | Pattern match, direct retrieval |
OpenAI’s context personalization cookbook makes a useful distinction between retrieval-based memory and state-based memory for use cases requiring continuity. Retrieval-based memory treats past interactions as loosely related documents and is brittle to phrasing variation and conflicting updates. Structured state extraction — writing typed, validated facts rather than embedding raw conversation chunks — produces more consistent results for facts that need to be applied reliably across sessions.

Memory Engineering Overview
Retrieval Strategy
Reading from memory is not a single operation. A well-designed retrieval layer checks working memory first (fast, cheap, exact key lookup), falls back to semantic search in episodic or semantic memory when nothing relevant surfaces, applies metadata filters for recency and trust level before returning results, and injects only what the current step needs.
Memory Maintenance
A store with no maintenance policy degrades over time. The entries accumulate, stale facts compete with current ones, and retrieval quality falls as signal-to-noise ratio drops. The following maintenance routines matter in practice: confidence decay on volatile facts, deduplication of semantically similar entries, TTL-based expiry on working memory and time-sensitive data, and periodic compression of old episodic records into session-level summaries.
A MemoryEntry schema that encodes these concerns directly makes write and maintenance logic easier to reason about:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
class MemoryEntry(BaseModel): content: str memory_type: str # working | episodic | semantic | procedural importance: float # 0.0–1.0, gates long-term storage confidence: float # decays over time for volatile facts trust_level: float # 1.0 internal system, 0.5 user input, 0.0 external created_at: datetime expires_at: datetime | None provenance: dict # agent_id, tool_name, session_id, input_hash def should_write_to_long_term(entry: MemoryEntry) -> bool: return ( entry.importance >= 0.6 and entry.confidence >= 0.7 and entry.trust_level >= 0.5 ) |
AI Agent Memory Design Guide – Working, Long-Term, and Procedural Memory with Forgetting and Staleness Management and 7 Steps to Mastering Memory in Agentic AI Systems are useful overviews of agent memory design.
The Retrieval Boundary: Connecting Memory and Context Engineering
Memory engineering and context engineering are often discussed as separate disciplines, but in practice they are deeply interconnected. Both exist to solve the same fundamental problem: ensuring that a model has access to the right information at the right time.
At a high level:
- Memory engineering focuses on persistence: what information should be stored, updated, retained, or forgotten over time.
- Context engineering focuses on utilization: what information should enter the active context window for a specific task and how it should be organized.
- Retrieval is the boundary where these two disciplines meet.
Memory systems produce candidate information. Context assembly then decides:
- Whether that information should enter the prompt
- How much of it should be included
- Where it should be placed within the context window
Managing this boundary well is what transforms a collection of memory components into a coherent agent system.
Failure Mode #1: Retrieval Without a Context Budget
One of the most common failures occurs when retrieval is treated independently from context assembly.
A memory search returns a set of relevant entries, and the context assembler injects all of them into the prompt. As more memories are added, the context window gradually fills with retrieved content, leaving less room for instructions, tool outputs, reasoning traces, and task-specific information.
The resulting symptoms are often misleading:
- Retrieval quality appears high
- Relevant memories are successfully found
- System performance still degrades
In many cases, the memory system has done its job correctly. The failure occurs because context assembly lacks a budgeting mechanism.
A better approach is retrieval-aware context assembly. Instead of retrieving first and budgeting later, the context layer allocates a token budget before retrieval begins. The retrieval layer then returns only the highest-value memories that fit within that budget.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
async def retrieve_for_step( self, step: AgentStep, max_tokens: int ) -> str: candidates = await self.memory.search( query=step.retrieval_query, max_results=10, filters={ "trust_level": {"gte": 0.5}, "expires_at": {"gt": datetime.now()} } ) selected = [] used = 0 for entry in sorted( candidates, key=lambda e: e.relevance_score, reverse=True ): cost = self.token_count(entry.content) if used + cost > max_tokens: break selected.append(entry.content) used += cost return "\n\n".join(selected) |
The key idea is simple: retrieval must operate within context constraints, not assume unlimited space downstream.
Failure Mode #2: Poor Placement of Retrieved Information
Retrieval quality alone is not sufficient. Even highly relevant memories can fail if they are placed incorrectly inside the context window.
A common issue is treating retrieval purely as a search problem while ignoring placement. Retrieved memories are appended wherever they arrive, without considering their role in the current reasoning step.
This becomes more impactful in long contexts. Attention is not uniformly distributed across the prompt. Information placed deep inside a long context can receive significantly less influence than information positioned near the beginning or end. This leads to a subtle failure mode:
- The correct information is retrieved
- The information is inserted into context
- The model behaves as if it is missing
The retrieval succeeded but the placement failed. Context assembly should therefore optimize both:
- Selection: what enters the context window
- Placement: where it appears within the context window
Retrieved information that must influence the current step should be positioned near the active reasoning region rather than appended arbitrarily.
Retrieval as a Step in Context Construction
Retrieval is the first step in turning stored memory into usable context. The goal is not only to retrieve relevant information, but to ensure it is the right information for the current step, in the right amount to fit within the context budget, and placed in the right location where the model can effectively use it.
When memory engineering and context engineering are treated as a single retrieval-to-context pipeline, rather than isolated components, agent systems become more reliable, efficient, and scalable.
Context Engineering – LLM Memory and Retrieval for AI Agents by Weaviate is a great reference.
Summary
Context and memory engineering are two layers of a single system that controls what the model knows, when it knows it, and how that knowledge is used.
Context engineering operates at inference time, shaping the active information window. Memory engineering operates across time, shaping what information persists and how it can be retrieved later.
| Dimension | Context Engineering | Memory Engineering |
|---|---|---|
| Core question | What should the model see right now, and how? | What should the system retain, and for how long? |
| Primary artifact | Assembled context window per inference call | Persisted memory entries across calls and sessions |
| Token management | Budget allocation per window component | Storage cost per entry type; retrieval cost per query |
| Compression | Tool outputs summarized before injection; history rolled or extracted | Old episodic records compressed; stale facts decayed or pruned |
| Freshness | Rolling history window; stale turns dropped | TTL on volatile facts; confidence decay over time |
| Trust | Source hierarchy governs assembly order | Provenance tracked per entry; low-trust content sanitized before write |
| Multi-agent | Each agent assembles its own window independently | Scoped namespaces per agent; shared namespace for cross-agent facts |
| Failure mode | Overflow, attention degradation, noisy assembly | Poisoning, staleness, retrieval miss, unbounded growth |
| Maintenance | Proactive compression at defined intervals | TTL expiry, deduplication, confidence decay, episodic archiving |
| Where they meet | Retrieved memory enters context: budget and placement govern how | Context assembly requests retrieval within a token budget constraint |
To sum up, an agentic system only works when both layers are aligned: memory determines what is available, and context determines what becomes actionable.






No comments yet.