1. Introduction
The token limit of large model context windows (typically 4K–128K tokens) determines the amount of information that can be carried in a single conversation. When an Agent performs multiple rounds of tool calls or complex reasoning, early conversation content may be truncated as the context grows, leading to a “memory loss” problem. This article specifically discusses implementation solutions for short-term session memory—how to efficiently manage conversation history within the current session so that the Agent can maintain contextual coherence.
After reading, readers will understand: the core challenges of short-term memory (token overflow, information loss, retrieval efficiency), master various engineering implementation paths from simple truncation to vector retrieval, and learn how to implement technical solutions such as MemorySaver, summary compression, and vector databases within frameworks like LangGraph. This article does not cover the complete design of long-term cross-session memory (e.g., user profiles, knowledge base persistence), but will discuss the connection points between the two in subsequent chapters.
2. Core Concepts: Definition and Boundaries of Short-Term Memory
2.1 The Role of Short-Term Memory in Agent Memory Systems
The Agent’s memory system is typically divided into three levels:
- Working Memory: Temporary state during the current task execution, e.g., “just called the weather API, now parsing the returned result.” This data exists only within the context window of a single inference and is automatically discarded when the model inference ends. It is usually carried by system messages or conversation records in the prompt.
- Short-term Memory: Context across multiple steps within the same session.
For example, a user first asks “What’s the weather like in Beijing?” and then asks “What about Shanghai?” The Agent needs to remember the topic “we were discussing city weather.” Short-term memory requires engineering management; otherwise, when tokens overflow and the context is truncated, the Agent loses the conversation topic.
- Long-term Memory: Cross-session user preferences, business rules, historical interaction records.
Usually implemented with external storage (vector databases, relational databases) for persistence.
This article focuses on short-term memory (session-level). Its distinction from working memory: working memory is the short window the model is currently processing (typically only a few rounds), while short-term memory requires active management—deciding which information to retain and which to discard when nearing the token limit.
2.2 Core Challenges of Short-Term Memory
Implementing short-term memory faces three interrelated issues:
- Context Window Truncation (Token Overflow): This is the most direct problem. As the number of conversation rounds increases and the accumulated token count exceeds the model’s context window, the earliest dialogue is truncated. At best, user intent is lost; at worst, the Agent lacks necessary prerequisite information for subsequent steps.
- Information Loss: Truncation or compression strategies inevitably cause information loss. For example, fixed-window truncation completely discards old dialogue; summary compression loses details, especially precise information like numbers, addresses, or specific instructions.
- Retrieval Efficiency: When it is necessary to quickly locate key information from a large amount of historical dialogue (e.g., the “customer ID” the user mentioned three rounds ago), linearly scanning all history is unacceptable in terms of latency and token cost. An efficient retrieval mechanism is needed.
There is a trade-off among these three challenges: retaining more information requires more complex compression or retrieval solutions, but introduces additional computational overhead and latency. In engineering practice, the appropriate solution should be selected based on the business scenario (high real-time vs. high accuracy).
3. Basic Solutions: Context Window Truncation Solutions
3.1 Fixed Window Sliding
The simplest strategy: fixedly retain the most recent N rounds of dialogue, discarding all earlier records. Implementation is as follows:
- Maintain a circular queue or deque with capacity set to N (e.g., 20 rounds).
- When a new round of dialogue is added, if the queue is full, pop the oldest round.
- Concatenate the messages in the queue into the LLM’s messages list.
Pros: Extremely simple to implement, no additional computational cost, latency nearly zero. Suitable for short session scenarios (e.g., single queries, single tool calls).
Cons: Once early dialogue is discarded, information is completely lost. If the user suddenly mentions “what was that parameter just now?” during the conversation, the Agent cannot answer. Additionally, selecting N requires human experience: for tool-call-intensive scenarios, 20 rounds may contain a large amount of tool-returned data causing token overflow; for pure chat scenarios, 20 rounds may only consume a small number of tokens.
3.2 Token-Threshold-Based Truncation
A more refined approach than fixed rounds: track the total token count of the current messages list. When it exceeds a threshold, discard messages one by one starting from the earliest until the total token count falls below the threshold.
- The threshold is usually set to 70%–80% of the model’s maximum context window, reserving some tokens for subsequent reply generation.
- The discard strategy can be “discard the single oldest message” or “discard the oldest few messages.”
Pros: More efficient use of the context window (fixed rounds may discard while the window is not full, or fail to discard when the window overflows). Suitable for scenarios with uneven conversation lengths.
Cons: Like fixed window, information is completely lost after discarding. For long conversation scenarios, neither solution is ideal.
These two solutions are suitable for prototype validation and short-term interactions with low requirements for historical information (e.g., customer service scenarios requiring only the current question). If the business frequently needs to reference historical information, more advanced solutions should be used.
4. Intermediate Solution: Dialogue History Summary Compression
4.1 Principle
The core idea of summary compression: when the dialogue history is about to exceed the token threshold, call the LLM to compress the earliest messages (or all messages) in the current messages list into a brief summary, replacing the original messages with this summary. The summary occupies far fewer tokens than the original messages, thus “freeing up space” in the context.
Implementation flow:
- Continuously collect dialogue messages.
- When the total token count of messages approaches the threshold (e.g., 80%), trigger the compression operation.
- Send the historical messages to be compressed (usually the earliest part) to the LLM, with a prompt example: “Please compress the following conversation into a summary, retaining key information (such as user-specified parameters, tool execution results, conclusions, etc.), within XXX tokens.”
- The LLM returns the summary text. Replace the original historical messages with the summary message (usually as a system or assistant message, indicating it is a history summary).
- Continue the subsequent dialogue, appending new messages to the list.
4.2 Key Engineering Practices
Compression Trigger Timing: It is recommended to trigger dynamically based on total token count rather than a fixed number of rounds. For example, set a token threshold to max_tokens * 0.8, and execute compression when the actual tokens exceed this threshold. If the tokens are still near the limit after compression, multiple rounds of compression can be performed (compress one part, then another).
Compression Storage Location: The summary can be stored in the messages list in memory (as above) or persisted to external cache (Redis, SQLite) for later queries. The advantage of external storage is that if the user asks to “go back to topic X, start from there,” part of the history can be restored by retrieving from the external cache.
Compression Update Strategy: Two approaches:
- Incremental Compression: Compress only the earliest segment each time (e.g., the earliest 5 rounds), retaining the rest of the history. The advantage is saving tokens by not re-summarizing all history. The disadvantage is potential information overlap or gaps between summaries.
- Full Compression: Compress all history on the first trigger, and subsequently only compress the incremental part. The LLM can provide a new incremental summary based on the previous summary, similar to an “incremental update.” In LangGraph implementations, the compression node can decide whether to trigger full or incremental compression after each state update.
4.3 Typical Implementation: Summary Node in LangGraph
In LangGraph, summary compression can be implemented using MemorySaver together with a custom node. MemorySaver is responsible for persisting the conversation state, while the compression node determines after each state update whether tokens exceed the limit, and if so, performs compression.
1 | |
1 | |
Note: Summary compression loses precise details of the original dialogue. In engineering, evaluate the business scenario: for needs requiring accurate recall of “the user said ‘use coupon code ABCD’ three turns ago,” the summary may miss this information. If high precision is required, consider the vector retrieval solution in the next section.
5. Practical Code: LangGraph-Based MemorySaver for Session-Level Memory
5.1 What is MemorySaver?
MemorySaver is a built-in checkpoint saver in LangGraph used to save state snapshots (including messages, intermediate variables currently running, etc.) after each step of the Agent. It uses SQLite as the backend storage by default and can also be configured for in-memory storage. Its role:
- When the Agent needs to recover due to an error or interruption, it can replay from the most recent checkpoint.
- In multi-turn conversations, it automatically maintains the messages list without manual history management.
5.2 Core Code Implementation
The following example shows how to create an Agent and inject MemorySaver to achieve automatic management of session-level short-term memory.
1 | |
5.3 How MemorySaver Works
After each
invoke()call,MemorySaverpersists the currentAgentStateto an SQLite file (default path ismemory.dbin the current directory).thread_idis used to isolate memory between different sessions; conversations with the samethread_idbelong to the same session and share memory; memories between differentthread_ids are isolated.In code, you don’t need to manually manage message history; just ensure that
AgentStatecontains amessageslist, andMemorySaverautomatically records the complete messages list after each state update.
5.4 Notes and Tuning
- Storage Bloat:
MemorySaversaves a complete state snapshot at each step. If the conversation is very long (hundreds of rounds), the database will quickly bloat. In production, it is recommended to regularly clean old checkpoints or limit the maximum number of checkpoints. LangGraph provides theMaxHistoryparameter or manual checkpoint deletion methods. - Combining with Summary Compression: Summary logic can be implemented inside the
chatbotnode. When themessageslength exceeds a threshold, compress the earliest dialogue into a summary.
MemorySaver will save the compressed messages, so the state in the next call will be the compressed version.
- Concurrency Control: SQLite does not natively support high-concurrency writes. In high-concurrency production environments (multiple threads/processes operating on the same file), consider using
PostgresSaver(PostgreSQL backend provided by LangGraph) or a concurrency-supporting store like Redis.
6. Advanced Solution: Vector Database for Short-Term Session Retrieval
6.1 Applicable Scenarios
Summary compression loses details, and fixed-window truncation directly discards information. If the business scenario requires precise recall of historical conversations (e.g., the user asks “what was the customer requirement mentioned in the third conversation last month?”), the above solutions are insufficient. In such cases, a retrieval mechanism is needed: split multi-turn conversations into several semantic chunks, embed them as vectors, and store them in a vector database. When the user asks a question, retrieve relevant chunks based on semantic similarity.
6.2 Implementation Idea
Chunking and Embedding: Split the current session’s conversation history into chunks by round or semantic paragraph, each chunk containing several rounds of dialogue. Generate vector representations for each chunk using an embedding model (e.g.,
text-embedding-3-small), and store them in an in-memory vector library (e.g.,chromadb‘sEphemeralClient) or Redis vector store.Retrieval: When the user asks a new question, embed the question as a vector as well, compute cosine similarity with the stored dialogue chunk vectors, and retrieve the top-K most relevant chunks.
Injecting Context: Inject the retrieved chunk text along with the current dialogue messages into the LLM as context. Typically, the retrieved content is concatenated into a system message or user message.
6.3 Comparison with Summary Solution
| Dimension | Summary Compression | Vector Retrieval |
|---|---|---|
| Information Retention | Loss of details (especially precise numbers, long text) | Retains original text, but recalls only relevant chunks |
| Latency | Low (only one LLM compression call) | Medium (embedding + retrieval + LLM parsing) |
| Key Information Recall | Depends on summary quality, easy to miss | Semantic matching accurate, but may recall irrelevant chunks |
| Implementation Complexity | Low | Medium (requires managing vector store, embedding calls) |
| Applicable Scenarios | Many conversational rounds but no need for precise review | Requires precise access to specific historical information |
Selection Suggestion: For most enterprise applications, a hybrid approach is more reasonable—use MemorySaver + summary compression for recent dialogue (ensuring low latency), while simultaneously vectorizing each conversation round in a short-term vector store. Enable retrieval when user questions involve historical details. That is, a combination of “short-window memory + retrievable history.”
7. Enterprise-Grade Agent Memory Architecture: Multi-Level Short-Term Memory Design
7.1 Lightweight vs. Enterprise-Grade Solution Comparison
| Dimension | Lightweight Solution | Enterprise-Grade Solution |
|---|---|---|
| Storage Medium | Memory, Redis Cache, SQLite | Distributed Cache (Redis Cluster) + Relational Database + Vector DB |
| Memory Hierarchy | Current session only | Working Memory (Memory) → Short-Term Memory (Redis/SQLite) → Long-Term Memory (RAG Vector Store) |
| Permissions & Isolation | Simple session ID isolation | User-level, role-level, tenant-level isolation; supports audit logs |
| Version Management | None | Supports state rollback, checkpoint recovery |
| Cleanup Strategy | TTL auto-expiration | Auto-archiving + policy-based cleanup (by time, by importance) |
7.2 Enterprise-Grade Multi-Level Design Example
- Working Memory Layer: Context for the current inference round, exists within the LLM’s input prompt, managed by LangGraph’s
StateGraph, not persisted. - Short-Term Memory Layer: The most recent N sessions (e.g., last 10 interactions), stored in Redis with key format
session:{user_id}:short_term, value as serialized compressed summary + vector index pointer.
Set TTL (e.g., 7 days), auto-delete after expiration.
- Long-Term Memory Layer: Cross-session important information (user preferences, key business conclusions), stored via RAG flow in a vector database (e.g., Pinecone, Milvus), managed by a separate long-term memory service.
Version Management and Isolation: Each user_id + session_id corresponds to an independent short-term memory key. In multi-user systems, ensure short-term memories do not contaminate each other. LangGraph’s thread_id mechanism naturally supports session isolation; at the enterprise level, a user-session mapping layer can be encapsulated on top (e.g., mapping user_id to thread_id).
8. Practical Comparison: Lightweight vs. Enterprise-Grade Short-Term Memory Integration
8.1 Lightweight Example: Simple Redis-Based Storage
1 | |
Applicable Scenarios: Prototypes or internal tools for single machines or small teams. Redis provides read/write latency of a few milliseconds.
8.2 Enterprise Example: LangGraph + Chroma Vector Retrieval
1 | |
8.3 Selection Suggestions
- From Implementation Complexity: Lightweight solutions can be integrated within two days; enterprise solutions require building a vector store, designing permission isolation, typically 1–2 weeks.
- Latency: Lightweight solutions (Redis get/set) < 5ms; vector retrieval solutions (embedding + retrieval) typically 50–200ms, but retrieval itself is asynchronous and can be pre-retrieved while the user is thinking.
- Cost: Vector retrieval increases embedding model API calls and vector store storage costs.
If the conversation volume is not large (daily active users < 1000), using memory or SQLite is sufficient.
- Maintainability: Lightweight solutions are difficult to scale; enterprise solutions require dedicated deployment and operations.
9. Pitfalls and Performance Optimization
9.1 Common Pitfalls
- Summary Loses Key Information: LLM compression summaries may ignore specific numbers, addresses, raw data returned by tools. Solution: Explicitly state in the compression prompt “retain all numbers and proper nouns,” or do not compress dialogue chunks containing numerical values—directly store them in the vector store.
- Inappropriate Vector Retrieval Dimension Selection: Using OpenAI’s 1536-dimensional embedding model has acceptable performance, but if using a larger-dimensional open-source model (e.g., 4096 dimensions), retrieval latency increases significantly.
It is recommended to select dimensions based on data size: for small datasets (< 100k entries), 768 dimensions are sufficient.
- MemorySaver Not Cleaning Expired Sessions: In high-frequency conversation scenarios, the SQLite file will continuously bloat. Configure the
MaxHistoryparameter or periodically callmemory.delete_checkpoints(session_id)to clean up.
It is recommended to add a scheduled task at application startup to scan sessions with expired TTL and clear them.
- Inconsistency Between Vector Store and MemorySaver Data: Dialogue chunks in the vector store may be compressed or truncated in MemorySaver. It is recommended to update the compressed summary text into the vector store, or synchronously delete retrieval entries that have been truncated.
9.2 Performance Optimization
Asynchronous Cache Writes: For vector retrieval scenarios, asynchronize embedding and writing to avoid blocking the conversation flow. Use
celeryor a localThreadPoolExecutor.Use Smaller Models for Summary Generation: Summary compression does not require high precision; using
gpt-4o-minior a local small model (e.g.,Qwen2-1.5B) can reduce costs.Set TTL for Automatic Cleanup: Whether using Redis or SQLite, set a session TTL (e.g., 1 day) to auto-delete upon expiration. Avoid unlimited storage growth.
9.3 Cross-Session Short-Term Memory Concatenation
In real business scenarios, users may work on the same task across multiple sessions (e.g., long form filling). In such cases, short-term memory needs to penetrate multiple sessions. Recommended approach:
- Include user ID and task ID in the
thread_id:user-{id}-task-{task_id}. - When the user restarts a session, query the previous short-term memory summary via user ID + task ID.
- Inject the summary as an initial system message into the new session.
This design essentially merges short-term and long-term memory—elevating a small amount of cross-session information into “short-term type long-term memory.”
10. Summary and Extensions
10.1 Applicable Scenario Summary for Three Short-Term Memory Solutions
| Solution | Applicable Scenarios | Core Cost |
|---|---|---|
| Context Window Truncation | Short sessions, prototype validation | Complete information loss |
| Dialogue History Summary Compression | Medium-length conversations (10–50 rounds), can tolerate detail loss | Loses details, one LLM call |
| Vector Database Retrieval | Long conversations, need precise recall, knowledge-intensive scenarios | Additional latency + API cost |
In production, a hybrid path is recommended: MemorySaver manages the complete messages, summary compression controls token length, and each conversation round is vectorized and stored for precise retrieval. This balances low latency and high recall.
10.2 Extension Directions
- Automatic Grading of Short-Term and Long-Term Memory: Develop an importance scoring model to automatically determine which information is worth elevating to long-term memory (e.g., explicit user preferences, key business decisions), while the rest is automatically cleaned after TTL. An asynchronous “memory evaluation” task can be executed in parallel after the
chat_with_memorynode ends. - Multimodal Short-Term Memory: Current solutions mainly handle text.
For Agents involving non-text modalities like images and tables (e.g., document Agents), image compression and retrieval need to be considered.
- LangGraph Advanced Checkpoint Configuration: The official documentation introduces
PostgresSaver(high concurrency),SqliteSavercustom table structures, and checkpoint saving strategies (e.g., keep only the last 10 checkpoints).
A recommended production configuration is: use PostgresSaver with max_checkpoints_per_thread=20, combined with a scheduled SQL cleanup task.
10.3 Recommended Further Reading
- LangGraph Official Documentation: Configuration examples for
MemorySaverandPostgresSaver - Redis Stack Documentation: Integration cases for the
RediSearchvector database module - Paper MemGPT: Towards LLMs as Operating Systems: Hierarchical management design of virtual context windows
Short-term session memory is a key module for Engineering-level Agent deployment. From simple truncation to vector retrieval, and then to enterprise-level hierarchical design, each solution corresponds to specific resource constraints and business requirements. In practice, it is recommended to prioritize LangGraph’s MemorySaver with custom summary nodes to first solve the context coherence problem, then gradually introduce vector retrieval based on actual performance bottlenecks.
Summary
Through this article, we believe you have gained a deeper understanding of “Agent Short-Term Session Memory Implementation.” We suggest practicing more in combination with actual projects. If you have any questions, feel free to discuss!