1. Introduction

The token limit of large model context windows (typically 4K–128K tokens) determines the amount of information that can be carried in a single conversation. When an Agent performs multiple rounds of tool calls or complex reasoning, early conversation content may be truncated as the context grows, leading to a “memory loss” problem. This article specifically discusses implementation solutions for short-term session memory—how to efficiently manage conversation history within the current session so that the Agent can maintain contextual coherence.

After reading, readers will understand: the core challenges of short-term memory (token overflow, information loss, retrieval efficiency), master various engineering implementation paths from simple truncation to vector retrieval, and learn how to implement technical solutions such as MemorySaver, summary compression, and vector databases within frameworks like LangGraph. This article does not cover the complete design of long-term cross-session memory (e.g., user profiles, knowledge base persistence), but will discuss the connection points between the two in subsequent chapters.

2. Core Concepts: Definition and Boundaries of Short-Term Memory

2.1 The Role of Short-Term Memory in Agent Memory Systems

The Agent’s memory system is typically divided into three levels:

  • Working Memory: Temporary state during the current task execution, e.g., “just called the weather API, now parsing the returned result.” This data exists only within the context window of a single inference and is automatically discarded when the model inference ends. It is usually carried by system messages or conversation records in the prompt.
  • Short-term Memory: Context across multiple steps within the same session.

For example, a user first asks “What’s the weather like in Beijing?” and then asks “What about Shanghai?” The Agent needs to remember the topic “we were discussing city weather.” Short-term memory requires engineering management; otherwise, when tokens overflow and the context is truncated, the Agent loses the conversation topic.

  • Long-term Memory: Cross-session user preferences, business rules, historical interaction records.

Usually implemented with external storage (vector databases, relational databases) for persistence.

This article focuses on short-term memory (session-level). Its distinction from working memory: working memory is the short window the model is currently processing (typically only a few rounds), while short-term memory requires active management—deciding which information to retain and which to discard when nearing the token limit.

2.2 Core Challenges of Short-Term Memory

Implementing short-term memory faces three interrelated issues:

  • Context Window Truncation (Token Overflow): This is the most direct problem. As the number of conversation rounds increases and the accumulated token count exceeds the model’s context window, the earliest dialogue is truncated. At best, user intent is lost; at worst, the Agent lacks necessary prerequisite information for subsequent steps.
  • Information Loss: Truncation or compression strategies inevitably cause information loss. For example, fixed-window truncation completely discards old dialogue; summary compression loses details, especially precise information like numbers, addresses, or specific instructions.
  • Retrieval Efficiency: When it is necessary to quickly locate key information from a large amount of historical dialogue (e.g., the “customer ID” the user mentioned three rounds ago), linearly scanning all history is unacceptable in terms of latency and token cost. An efficient retrieval mechanism is needed.

There is a trade-off among these three challenges: retaining more information requires more complex compression or retrieval solutions, but introduces additional computational overhead and latency. In engineering practice, the appropriate solution should be selected based on the business scenario (high real-time vs. high accuracy).

3. Basic Solutions: Context Window Truncation Solutions

3.1 Fixed Window Sliding

The simplest strategy: fixedly retain the most recent N rounds of dialogue, discarding all earlier records. Implementation is as follows:

  • Maintain a circular queue or deque with capacity set to N (e.g., 20 rounds).
  • When a new round of dialogue is added, if the queue is full, pop the oldest round.
  • Concatenate the messages in the queue into the LLM’s messages list.

Pros: Extremely simple to implement, no additional computational cost, latency nearly zero. Suitable for short session scenarios (e.g., single queries, single tool calls).

Cons: Once early dialogue is discarded, information is completely lost. If the user suddenly mentions “what was that parameter just now?” during the conversation, the Agent cannot answer. Additionally, selecting N requires human experience: for tool-call-intensive scenarios, 20 rounds may contain a large amount of tool-returned data causing token overflow; for pure chat scenarios, 20 rounds may only consume a small number of tokens.

3.2 Token-Threshold-Based Truncation

A more refined approach than fixed rounds: track the total token count of the current messages list. When it exceeds a threshold, discard messages one by one starting from the earliest until the total token count falls below the threshold.

  • The threshold is usually set to 70%–80% of the model’s maximum context window, reserving some tokens for subsequent reply generation.
  • The discard strategy can be “discard the single oldest message” or “discard the oldest few messages.”

Pros: More efficient use of the context window (fixed rounds may discard while the window is not full, or fail to discard when the window overflows). Suitable for scenarios with uneven conversation lengths.

Cons: Like fixed window, information is completely lost after discarding. For long conversation scenarios, neither solution is ideal.

These two solutions are suitable for prototype validation and short-term interactions with low requirements for historical information (e.g., customer service scenarios requiring only the current question). If the business frequently needs to reference historical information, more advanced solutions should be used.

4. Intermediate Solution: Dialogue History Summary Compression

4.1 Principle

The core idea of summary compression: when the dialogue history is about to exceed the token threshold, call the LLM to compress the earliest messages (or all messages) in the current messages list into a brief summary, replacing the original messages with this summary. The summary occupies far fewer tokens than the original messages, thus “freeing up space” in the context.

Implementation flow:

  1. Continuously collect dialogue messages.
  2. When the total token count of messages approaches the threshold (e.g., 80%), trigger the compression operation.
  3. Send the historical messages to be compressed (usually the earliest part) to the LLM, with a prompt example: “Please compress the following conversation into a summary, retaining key information (such as user-specified parameters, tool execution results, conclusions, etc.), within XXX tokens.”
  4. The LLM returns the summary text. Replace the original historical messages with the summary message (usually as a system or assistant message, indicating it is a history summary).
  5. Continue the subsequent dialogue, appending new messages to the list.

4.2 Key Engineering Practices

Compression Trigger Timing: It is recommended to trigger dynamically based on total token count rather than a fixed number of rounds. For example, set a token threshold to max_tokens * 0.8, and execute compression when the actual tokens exceed this threshold. If the tokens are still near the limit after compression, multiple rounds of compression can be performed (compress one part, then another).

Compression Storage Location: The summary can be stored in the messages list in memory (as above) or persisted to external cache (Redis, SQLite) for later queries. The advantage of external storage is that if the user asks to “go back to topic X, start from there,” part of the history can be restored by retrieving from the external cache.

Compression Update Strategy: Two approaches:

  • Incremental Compression: Compress only the earliest segment each time (e.g., the earliest 5 rounds), retaining the rest of the history. The advantage is saving tokens by not re-summarizing all history. The disadvantage is potential information overlap or gaps between summaries.
  • Full Compression: Compress all history on the first trigger, and subsequently only compress the incremental part. The LLM can provide a new incremental summary based on the previous summary, similar to an “incremental update.” In LangGraph implementations, the compression node can decide whether to trigger full or incremental compression after each state update.

4.3 Typical Implementation: Summary Node in LangGraph

In LangGraph, summary compression can be implemented using MemorySaver together with a custom node. MemorySaver is responsible for persisting the conversation state, while the compression node determines after each state update whether tokens exceed the limit, and if so, performs compression.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from langgraph.checkpoint import MemorySaver
from langgraph.graph import StateGraph, END
from typing import Dict, List, Optional

class CompressionNode:
def __init__(self, max_tokens: int = 2000):
self.max_tokens = max_tokens
self.memory = MemorySaver()

def should_compress(self, messages: List[Dict]) -> bool:
total = sum(len(m["content"]) for m in messages) # simplified calculation
return total > self.max_tokens

def compress(self, messages: List[Dict]) -> List[Dict]:
# Call LLM to compress the earliest part
oldest_chunk = messages[:5] # Example: compress earliest 5 messages
summary_prompt = f"Please summarize the following conversation into a summary: {oldest_chunk}"
# Actually need to call LLM to generate summary
summary = "<LLM generated summary>"
# Replace the earliest few messages with one summary
return [{"role": "system", "content": f"History summary: {summary}"}] + messages[5:]
1
2
3
4
5
6
7
8
9
10
11
from langgraph.graph import StateGraph

class AgentState:
messages: List[Dict]
token_count: int # Optional: precompute token count

graph = StateGraph(AgentState)
graph.add_node("chat", chat_node) # Normal conversation node
graph.add_node("compress", compress_node) # Compression node
graph.add_conditional_edge("chat", "compress", should_compress)
graph.add_edge("compress", "chat")

Note: Summary compression loses precise details of the original dialogue. In engineering, evaluate the business scenario: for needs requiring accurate recall of “the user said ‘use coupon code ABCD’ three turns ago,” the summary may miss this information. If high precision is required, consider the vector retrieval solution in the next section.

5. Practical Code: LangGraph-Based MemorySaver for Session-Level Memory

5.1 What is MemorySaver?

MemorySaver is a built-in checkpoint saver in LangGraph used to save state snapshots (including messages, intermediate variables currently running, etc.) after each step of the Agent. It uses SQLite as the backend storage by default and can also be configured for in-memory storage. Its role:

  • When the Agent needs to recover due to an error or interruption, it can replay from the most recent checkpoint.
  • In multi-turn conversations, it automatically maintains the messages list without manual history management.

5.2 Core Code Implementation

The following example shows how to create an Agent and inject MemorySaver to achieve automatic management of session-level short-term memory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from langgraph.graph import StateGraph, END
from langgraph.checkpoint import MemorySaver
from typing import TypedDict, List, Dict, Any

class AgentState(TypedDict):
messages: List[Dict[str, Any]]
next: str

def chatbot(state: AgentState) -> AgentState:
# Simulate LLM reply: generate response from state["messages"]
last_user_message = state["messages"][-1]["content"]
reply = f"You said: {last_user_message}"
state["messages"].append({"role": "assistant", "content": reply})
return state

# Initialize a simple graph: only chat node
graph = StateGraph(AgentState)
graph.add_node("chat", chatbot)
graph.set_entry_point("chat")
graph.add_edge("chat", END)

# Inject MemorySaver
memory = MemorySaver()
app = graph.compile(checkpointer=memory)

# Simulate multi-turn conversation
config = {"configurable": {"thread_id": "session-123"}}
for user_msg in ["Hello", "What's the weather like today?", "Please help me check flights from Shanghai to Beijing"]:
app.invoke({"messages": [{"role": "user", "content": user_msg}]}, config)
# Print current state messages; you can see history accumulating
current_state = app.get_state(config)
print(current_state.values["messages"])

5.3 How MemorySaver Works

  • After each invoke() call, MemorySaver persists the current AgentState to an SQLite file (default path is memory.db in the current directory).

  • thread_id is used to isolate memory between different sessions; conversations with the same thread_id belong to the same session and share memory; memories between different thread_ids are isolated.

  • In code, you don’t need to manually manage message history; just ensure that AgentState contains a messages list, and MemorySaver automatically records the complete messages list after each state update.

5.4 Notes and Tuning

  • Storage Bloat: MemorySaver saves a complete state snapshot at each step. If the conversation is very long (hundreds of rounds), the database will quickly bloat. In production, it is recommended to regularly clean old checkpoints or limit the maximum number of checkpoints. LangGraph provides the MaxHistory parameter or manual checkpoint deletion methods.
  • Combining with Summary Compression: Summary logic can be implemented inside the chatbot node. When the messages length exceeds a threshold, compress the earliest dialogue into a summary.

MemorySaver will save the compressed messages, so the state in the next call will be the compressed version.

  • Concurrency Control: SQLite does not natively support high-concurrency writes. In high-concurrency production environments (multiple threads/processes operating on the same file), consider using PostgresSaver (PostgreSQL backend provided by LangGraph) or a concurrency-supporting store like Redis.

6. Advanced Solution: Vector Database for Short-Term Session Retrieval

6.1 Applicable Scenarios

Summary compression loses details, and fixed-window truncation directly discards information. If the business scenario requires precise recall of historical conversations (e.g., the user asks “what was the customer requirement mentioned in the third conversation last month?”), the above solutions are insufficient. In such cases, a retrieval mechanism is needed: split multi-turn conversations into several semantic chunks, embed them as vectors, and store them in a vector database. When the user asks a question, retrieve relevant chunks based on semantic similarity.

6.2 Implementation Idea

  1. Chunking and Embedding: Split the current session’s conversation history into chunks by round or semantic paragraph, each chunk containing several rounds of dialogue. Generate vector representations for each chunk using an embedding model (e.g., text-embedding-3-small), and store them in an in-memory vector library (e.g., chromadb‘s EphemeralClient) or Redis vector store.

  2. Retrieval: When the user asks a new question, embed the question as a vector as well, compute cosine similarity with the stored dialogue chunk vectors, and retrieve the top-K most relevant chunks.

  3. Injecting Context: Inject the retrieved chunk text along with the current dialogue messages into the LLM as context. Typically, the retrieved content is concatenated into a system message or user message.

6.3 Comparison with Summary Solution

Dimension Summary Compression Vector Retrieval
Information Retention Loss of details (especially precise numbers, long text) Retains original text, but recalls only relevant chunks
Latency Low (only one LLM compression call) Medium (embedding + retrieval + LLM parsing)
Key Information Recall Depends on summary quality, easy to miss Semantic matching accurate, but may recall irrelevant chunks
Implementation Complexity Low Medium (requires managing vector store, embedding calls)
Applicable Scenarios Many conversational rounds but no need for precise review Requires precise access to specific historical information

Selection Suggestion: For most enterprise applications, a hybrid approach is more reasonable—use MemorySaver + summary compression for recent dialogue (ensuring low latency), while simultaneously vectorizing each conversation round in a short-term vector store. Enable retrieval when user questions involve historical details. That is, a combination of “short-window memory + retrievable history.”

7. Enterprise-Grade Agent Memory Architecture: Multi-Level Short-Term Memory Design

7.1 Lightweight vs. Enterprise-Grade Solution Comparison

Dimension Lightweight Solution Enterprise-Grade Solution
Storage Medium Memory, Redis Cache, SQLite Distributed Cache (Redis Cluster) + Relational Database + Vector DB
Memory Hierarchy Current session only Working Memory (Memory) → Short-Term Memory (Redis/SQLite) → Long-Term Memory (RAG Vector Store)
Permissions & Isolation Simple session ID isolation User-level, role-level, tenant-level isolation; supports audit logs
Version Management None Supports state rollback, checkpoint recovery
Cleanup Strategy TTL auto-expiration Auto-archiving + policy-based cleanup (by time, by importance)

7.2 Enterprise-Grade Multi-Level Design Example

  • Working Memory Layer: Context for the current inference round, exists within the LLM’s input prompt, managed by LangGraph’s StateGraph, not persisted.
  • Short-Term Memory Layer: The most recent N sessions (e.g., last 10 interactions), stored in Redis with key format session:{user_id}:short_term, value as serialized compressed summary + vector index pointer.

Set TTL (e.g., 7 days), auto-delete after expiration.

  • Long-Term Memory Layer: Cross-session important information (user preferences, key business conclusions), stored via RAG flow in a vector database (e.g., Pinecone, Milvus), managed by a separate long-term memory service.

Version Management and Isolation: Each user_id + session_id corresponds to an independent short-term memory key. In multi-user systems, ensure short-term memories do not contaminate each other. LangGraph’s thread_id mechanism naturally supports session isolation; at the enterprise level, a user-session mapping layer can be encapsulated on top (e.g., mapping user_id to thread_id).

8. Practical Comparison: Lightweight vs. Enterprise-Grade Short-Term Memory Integration

8.1 Lightweight Example: Simple Redis-Based Storage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import redis
import json

class RedisShortTermMemory:
def __init__(self, redis_url: str = "redis://localhost:6379/0"):
self.client = redis.from_url(redis_url)
self.ttl = 3600 # 1 hour

def save(self, session_id: str, messages: list):
key = f"session_memory:{session_id}"
self.client.setex(key, self.ttl, json.dumps(messages))

def load(self, session_id: str) -> list:
key = f"session_memory:{session_id}"
data = self.client.get(key)
return json.loads(data) if data else []

def append(self, session_id: str, role: str, content: str):
messages = self.load(session_id)
messages.append({"role": role, "content": content})
self.save(session_id, messages)

# Usage example
mem = RedisShortTermMemory()
session_id = "user-123-session-1"
mem.append(session_id, "user", "Hello")
mem.append(session_id, "assistant", "Hello, how can I help you?")
print(mem.load(session_id))

Applicable Scenarios: Prototypes or internal tools for single machines or small teams. Redis provides read/write latency of a few milliseconds.

8.2 Enterprise Example: LangGraph + Chroma Vector Retrieval

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from langgraph.graph import StateGraph, END
from langgraph.checkpoint import MemorySaver
import chromadb
from chromadb.utils import embedding_functions

class VectorMemoryAgent:
def __init__(self, max_tokens: int = 3000):
self.memory_saver = MemorySaver()
self.chroma_client = chromadb.EphemeralClient()
self.collection = self.chroma_client.create_collection(
name="short_term",
embedding_function=embedding_functions.OpenAIEmbeddingFunction()
)
self.session_id = None

def get_agent(self):
# Build LangGraph Agent
graph = StateGraph(AgentState)
graph.add_node("chat", self.chat_with_memory)
graph.set_entry_point("chat")
graph.add_edge("chat", END)
return graph.compile(checkpointer=self.memory_saver)

def chat_with_memory(self, state: AgentState) -> AgentState:
# 1. If the current question needs historical review, retrieve relevant chunks from chroma
last_question = state["messages"][-1]["content"]
similar_chunks = self.collection.query(
query_texts=[last_question],
n_results=3
)
if similar_chunks:
# Append retrieval results as a system message
context = "\n".join(similar_chunks["documents"][0])
state["messages"].insert(-1, {"role": "system", "content": f"Relevant history: {context}"})
# 2. Call LLM to generate reply (omitted here)
# 3. Embed and store the current conversation round into chroma
self.collection.add(
ids=[f"{self.session_id}-{len(state['messages'])}"],
documents=[last_question + " " + state["messages"][-1]["content"]],
)
return state

8.3 Selection Suggestions

  • From Implementation Complexity: Lightweight solutions can be integrated within two days; enterprise solutions require building a vector store, designing permission isolation, typically 1–2 weeks.
  • Latency: Lightweight solutions (Redis get/set) < 5ms; vector retrieval solutions (embedding + retrieval) typically 50–200ms, but retrieval itself is asynchronous and can be pre-retrieved while the user is thinking.
  • Cost: Vector retrieval increases embedding model API calls and vector store storage costs.

If the conversation volume is not large (daily active users < 1000), using memory or SQLite is sufficient.

  • Maintainability: Lightweight solutions are difficult to scale; enterprise solutions require dedicated deployment and operations.

9. Pitfalls and Performance Optimization

9.1 Common Pitfalls

  • Summary Loses Key Information: LLM compression summaries may ignore specific numbers, addresses, raw data returned by tools. Solution: Explicitly state in the compression prompt “retain all numbers and proper nouns,” or do not compress dialogue chunks containing numerical values—directly store them in the vector store.
  • Inappropriate Vector Retrieval Dimension Selection: Using OpenAI’s 1536-dimensional embedding model has acceptable performance, but if using a larger-dimensional open-source model (e.g., 4096 dimensions), retrieval latency increases significantly.

It is recommended to select dimensions based on data size: for small datasets (< 100k entries), 768 dimensions are sufficient.

  • MemorySaver Not Cleaning Expired Sessions: In high-frequency conversation scenarios, the SQLite file will continuously bloat. Configure the MaxHistory parameter or periodically call memory.delete_checkpoints(session_id) to clean up.

It is recommended to add a scheduled task at application startup to scan sessions with expired TTL and clear them.

  • Inconsistency Between Vector Store and MemorySaver Data: Dialogue chunks in the vector store may be compressed or truncated in MemorySaver. It is recommended to update the compressed summary text into the vector store, or synchronously delete retrieval entries that have been truncated.

9.2 Performance Optimization

  • Asynchronous Cache Writes: For vector retrieval scenarios, asynchronize embedding and writing to avoid blocking the conversation flow. Use celery or a local ThreadPoolExecutor.

  • Use Smaller Models for Summary Generation: Summary compression does not require high precision; using gpt-4o-mini or a local small model (e.g., Qwen2-1.5B) can reduce costs.

  • Set TTL for Automatic Cleanup: Whether using Redis or SQLite, set a session TTL (e.g., 1 day) to auto-delete upon expiration. Avoid unlimited storage growth.

9.3 Cross-Session Short-Term Memory Concatenation

In real business scenarios, users may work on the same task across multiple sessions (e.g., long form filling). In such cases, short-term memory needs to penetrate multiple sessions. Recommended approach:

  1. Include user ID and task ID in the thread_id: user-{id}-task-{task_id}.
  2. When the user restarts a session, query the previous short-term memory summary via user ID + task ID.
  3. Inject the summary as an initial system message into the new session.

This design essentially merges short-term and long-term memory—elevating a small amount of cross-session information into “short-term type long-term memory.”

10. Summary and Extensions

10.1 Applicable Scenario Summary for Three Short-Term Memory Solutions

Solution Applicable Scenarios Core Cost
Context Window Truncation Short sessions, prototype validation Complete information loss
Dialogue History Summary Compression Medium-length conversations (10–50 rounds), can tolerate detail loss Loses details, one LLM call
Vector Database Retrieval Long conversations, need precise recall, knowledge-intensive scenarios Additional latency + API cost

In production, a hybrid path is recommended: MemorySaver manages the complete messages, summary compression controls token length, and each conversation round is vectorized and stored for precise retrieval. This balances low latency and high recall.

10.2 Extension Directions

  • Automatic Grading of Short-Term and Long-Term Memory: Develop an importance scoring model to automatically determine which information is worth elevating to long-term memory (e.g., explicit user preferences, key business decisions), while the rest is automatically cleaned after TTL. An asynchronous “memory evaluation” task can be executed in parallel after the chat_with_memory node ends.
  • Multimodal Short-Term Memory: Current solutions mainly handle text.

For Agents involving non-text modalities like images and tables (e.g., document Agents), image compression and retrieval need to be considered.

  • LangGraph Advanced Checkpoint Configuration: The official documentation introduces PostgresSaver (high concurrency), SqliteSaver custom table structures, and checkpoint saving strategies (e.g., keep only the last 10 checkpoints).

A recommended production configuration is: use PostgresSaver with max_checkpoints_per_thread=20, combined with a scheduled SQL cleanup task.

  • LangGraph Official Documentation: Configuration examples for MemorySaver and PostgresSaver
  • Redis Stack Documentation: Integration cases for the RediSearch vector database module
  • Paper MemGPT: Towards LLMs as Operating Systems: Hierarchical management design of virtual context windows

Short-term session memory is a key module for Engineering-level Agent deployment. From simple truncation to vector retrieval, and then to enterprise-level hierarchical design, each solution corresponds to specific resource constraints and business requirements. In practice, it is recommended to prioritize LangGraph’s MemorySaver with custom summary nodes to first solve the context coherence problem, then gradually introduce vector retrieval based on actual performance bottlenecks.

Summary

Through this article, we believe you have gained a deeper understanding of “Agent Short-Term Session Memory Implementation.” We suggest practicing more in combination with actual projects. If you have any questions, feel free to discuss!