Practical Guide to Building Long-Term Persistent Memory for Agents - Internal Knowledge Base Document
1. Introduction
In real-world agent deployments, stateless design is the primary cause of degraded multi-turn interaction quality. Repetitive user questions, agents forgetting previous instructions, and the inability to leverage historical experience across sessions—these problems fundamentally stem from a missing memory system. This article revolves around the theory of memory layering, uses LangGraph as the main code thread, explains the complete setup process for short-term and long-term memory, and introduces PostgreSQL as a structured persistence solution.
Upon completion, readers will master: the architectural layering design of memory systems, context window management and state maintenance for short-term memory, vectorized storage and semantic retrieval for long-term memory, and lifecycle management strategies for memory data.
2. Memory Layering Model: Architectural Design from Short-Term to Long-Term
2.1 Three-Level Memory Model
The design of an agent’s memory system draws from the theory of memory layering in cognitive science. In practice, we generally divide it into three levels:
Short-term Memory is responsible for storing the raw input and output within the current session, typically the historical messages within the LLM’s context window. Its characteristics are frequent reading and writing, limited capacity (constrained by the model’s token limit), and it is mainly used to maintain the coherence of a single conversation.
Working Memory tracks the status of tasks the agent is currently executing. For example, if a user asks “first check the weather, then book a hotel,” working memory maintains a task stack, recording which step is currently being executed and which subtasks have been completed. When the agent needs to temporarily switch to a subtask (e.g., checking COVID testing policies while checking the weather), working memory ensures the main task can be resumed.
Long-term Memory stores important knowledge across sessions, including user preferences, business rules, and key summaries of historical conversations. The carriers for long-term memory are typically external storage systems (vector databases, relational databases) that need to support persistence, efficient retrieval, and expiration cleanup.
The collaborative relationship among the three levels during agent runtime is as follows: After receiving user input, the agent first retrieves relevant context from long-term memory, concatenates it with short-term memory, and sends it to the LLM. Meanwhile, working memory manages the execution flow of the current task stack. Any valuable interaction result, after importance evaluation, is written back to long-term memory.
2.2 Enterprise Necessity of Long-Term Memory
In enterprise agent scenarios, long-term memory is not just a feature enhancement but a prerequisite for system availability. Below are typical cross-session requirements:
- User Preference Learning: User A prefers “detailed technical documentation,” User B prefers “concise summary descriptions.” Long-term memory needs to record these preferences and automatically apply them in subsequent interactions.
- Knowledge Reuse: A department confirmed a specific technical solution during client communication last month. During the next interaction, the agent should be able to reference the historical minutes to avoid redundant discussion.
- Contextual Reasoning: A user asks about “login failure issues” on Monday and then asks “Do I still need to update the certificate?” on Tuesday. The agent needs to associate this with the login failure scenario from the previous day’s long-term memory to provide the correct advice.
Without long-term memory, an agent is essentially an empty shell model that reinitializes with every conversation, unable to accumulate experience or build user trust.
3. Short-Term Memory Implementation: Context Window Management and State Maintenance
3.1 Sliding Window Truncation Strategy
The most basic management strategy for short-term memory is the sliding window—keeping only the most recent N rounds of conversation. Suppose the business requires the agent to remember the last 5 rounds of user input and assistant responses, then the window size is 10 messages (5 inputs, 5 outputs). Messages outside this range are simply discarded.
Below is an example implementation of BufferWindowMemory based on LangGraph:
1 | |
Note: This method is simple and efficient but has two limitations: (1) if a turn is very long (e.g., the user pastes an entire log), the window can quickly be filled by a single message; (2) the model cannot access earlier context that has been discarded, potentially leading to “forgetting” issues.
3.2 Summary Compression Strategy
To address the “one-size-fits-all” problem of the sliding window, we can introduce ConversationSummaryMemory—regularly generating summaries of earlier conversations and replacing the raw messages in the window.
1 | |
This approach is more stable in long conversations but increases LLM call overhead (needs a summary for each save). In practice, it’s recommended to only generate a summary when window truncation is triggered, rather than after every turn.
3.3 Boundaries of State Management
Although short-term memory can maintain conversational coherence, it cannot solve the “forgetting” problem across sessions. For example, if a user returns the next day and asks, “What do you think of the plan I mentioned yesterday?” short-term memory will have been cleared. Long-term memory must be relied upon to provide the previous day’s discussion. This is the boundary of short-term memory’s responsibility—it is limited to the current session and should not manage cross-session knowledge.
4. Long-Term Memory Storage Solution: Combined Vector Database + PostgreSQL Design
4.1 Dual-Carrier Storage Architecture
Long-term memory needs to support two retrieval modes simultaneously: semantic retrieval (fuzzy search for “experiences about database optimization”) and structured retrieval (exact query for “User A’s configuration preferences”). A single storage solution often cannot achieve both. In practice, we use a combined approach of vector databases (FAISS/Chroma) and relational databases (PostgreSQL):
| Storage Carrier | Applicable Scenario | Advantages | Disadvantages |
|---|---|---|---|
| Vector Database | Semantic similarity search | Fast retrieval, supports embedding | Weak structured querying |
| PostgreSQL | Exact queries, transaction management, reporting | Strong transactions, ACID | Full-text search not as good as vector retrieval |
4.2 Dual-Write Model Design
When the agent extracts a piece of memory to be stored, it simultaneously writes to both systems:
- Convert the memory content into a vector (via an embedding model like text-embedding-ada-002) and store it in the vector database index.
- Write the memory’s raw text, entity tags, timestamp, importance score, and other structured fields into the PostgreSQL table.
When retrieving, first get the Top K similar memory list from the vector database, then use PostgreSQL to retrieve the complete fields of those memories. This balances retrieval speed and data integrity.
Example of core PostgreSQL table structure:
1 | |
Engineering tip: When writing to PostgreSQL, use a connection pool (e.g., psycopg2‘s ThreadedConnectionPool) to avoid the overhead of creating a new connection for each storage operation. For vector database writes, it is recommended to use asynchronous APIs to avoid blocking the main flow.
5. Memory Lifecycle Management: Importance Evaluation and Automatic Cleanup
5.1 Why Lifecycle Management is Needed
Uncontrolled growth of long-term memory leads to two problems: storage bloat (increasing costs) and retrieval noise (a large number of irrelevant memories interfering with LLM reasoning). Memories must be graded and have expiration cleanup.
5.2 LLM-Based Importance Evaluation
Use an LLM to evaluate whether the information in the user input is worth storing in long-term memory. Example evaluation prompt:
1 | |
Set TTL (Time To Live) based on the score:
1 | |
Note: Do not evaluate every message; only those with clearly declarative information (e.g., “I live in Beijing”) to avoid unnecessary LLM calls.
5.3 Scheduled Cleanup Script
Execute the cleanup task early every morning:
1 | |
Strategy philosophy: high-value memories are only extended, never deleted; low-value memories are purged when they expire. Also keep an audit log of deletion operations for traceability.
6. Adding Long-Term Memory to Agents: Practical Memory Module Implementation in LangGraph
6.1 Embedding Memory Nodes in the Graph
LangGraph allows encapsulating memory read/write operations as independent graph nodes that are automatically triggered during the agent’s execution flow. Below is a typical design:
1 | |
6.2 Key Engineering Details
Retrieval Timing: The
retrieve_memorynode is placed at the very beginning of the graph execution to ensure the LLM has historical context before making decisions. Theextract_memorywrite node is placed after the graph execution ends to avoid interfering with the current round’s logic.Deduplication Strategy: When the same user repeatedly generates memories for the same entity, it should be updated rather than appended. Use
ON CONFLICTstatements in PostgreSQL for upsert.Error Handling: If PostgreSQL or the vector database write fails, the entire agent execution should not fail. Catch exceptions inside the
extract_memorynode and define a degradation strategy usinggraph.set_error_handler(e.g., log the error and continue execution).
7. Retrieval Enhancement Techniques: Multi-Modal Retrieval and Memory Fusion Strategy
7.1 Hybrid Memory Input
The LLM’s input includes two parts of memory: current session short-term memory (raw messages) and long-term memory retrieval results. How they are fused directly impacts reasoning quality. Recommended strategy:
1 | |
In LangChain, this can be implemented via a custom Memory class:
1 | |
7.2 Fusion Sorting and Re-Ranking
When using both semantic retrieval and keyword retrieval, the results need to be fused and sorted. RRF (Reciprocal Rank Fusion) is a lightweight fusion method:
1 | |
Re-ranking: After fusion, use a lightweight model (e.g., cross-encoder) to re-score and sort the Top 20 results, further compressing noise. This step can be omitted in latency-sensitive scenarios but is recommended for offline batch scenarios.
7.3 Time Decay Factor
The timeliness of memory is also important. A “user preference” from a year ago has different reliability compared to one from a day ago. Add a time decay factor during retrieval:
1 | |
For business rule memories, the decay factor is usually set to 1 (no decay); for temporary preference memories, decay can be faster.
8. Pitfall Logging and Performance Optimization Practices
8.1 Common Issue Checklist
1. Memory Write/Read Bottleneck
Phenomenon: Every time the agent executes, it must first retrieve long-term memory. Under high concurrency, PostgreSQL query latency increases from 1ms to 200ms, doubling the overall pipeline latency.
Cause: Connection pool not used, no partition indexing on the memory table.
Optimization Plan:
- Force use of connection pool (e.g.,
psycopg2.pool.ThreadedConnectionPool), min connections = concurrency × 2. - Hash-partition by
user_id, each partition with its own index.
2. High Vector Dimension Leading to Retrieval Latency
Phenomenon: Using a 1536-dimensional embedding model, retrieval on 1 million records takes over 300ms, failing real-time conversation requirements.
Optimization Plan:
- Dimensionality reduction: compress embedding from 1536 to 256 dimensions (using PCA or autoencoder), sacrificing a small amount of precision for more than 5x speed improvement.
- Index sharding: use IVF (Inverted File) index, set
nprobe=20(only search 20 most relevant cluster centers).
3. PostgreSQL Connection Pool Too Small
Phenomenon: After scaling agent instances from 5 to 20, PostgreSQL connections are still only 10, causing many requests to queue.
Optimization Plan: Dynamically expand the connection pool proportional to the number of agent instances. Upper limit should not exceed the database’s acceptable number of connections (usually 500 or less).
8.2 Performance Optimization Test Data
Below are internal stress test results (8-core 32G machine, 100k memories, concurrency 30):
| Optimization Measure | Average Response Time | Memory Usage | Notes |
|---|---|---|---|
| No optimization (full table scan) | 1200ms | 600MB | Unacceptable |
| Add indexes + connection pool | 230ms | 650MB | Acceptable |
| Async writes + cache layer | 85ms | 700MB | Recommended |
| Cache layer + vector index sharding | 45ms | 800MB | High throughput scenario |
Cache layer uses Redis, storing the most recent 1000 high-frequency memories to avoid repeated PostgreSQL queries. On writes, first write to Redis (1ms), then asynchronously batch write to PostgreSQL (every 5 seconds or when 20 items have accumulated).
9. Summary and Expansion Directions
9.1 Key Technical Points Review
This article has completely covered the setup process for agent long-term persistent memory, with key points as follows:
Layered Memory Architecture: Short-term memory (conversation context), working memory (task state stack), long-term memory (knowledge base), each layer with its own responsibilities and lifecycle.
Short-term Memory Window Management: Two strategies—sliding window truncation and summary compression—choose or combine based on business scenario.
Long-term Memory Multi-Modal Storage: Vector database for semantic retrieval, PostgreSQL for structured transactional storage, dual-write ensures data integrity and retrieval efficiency.
Lifecycle Management: LLM evaluates importance → sets TTL → scheduled cleanup strategy to avoid storage bloat and retrieval noise.
Retrieval Enhancement Fusion: Short-term + long-term memory hybrid input, RRF fusion sorting, time decay factor improves relevance.
Performance Optimization: Async writes, connection pools, cache layer, index sharding to ensure response stability under high concurrency.
9.2 Implementation Suggestions
Implement Short-term Memory First: Most agent applications perform better in the short term (within a single session). Short-term memory is the foundation and can be integrated within two weeks.
Layer in Long-term Memory Gradually: Start with structured data like user preferences and business rules, then expand to free-text memory extraction. Do not launch all features at once.
Monitor Memory Hit Rate: Log whether each retrieval returns valid memory. Hit rates below 30% indicate that the retrieval strategy or memory population strategy needs optimization.
Reserve Extension Interfaces: Future integration of multi-modal memory (images/audio) may be needed. In storage design, use extensible field types like JSONB (PostgreSQL).
9.3 Expansion Directions
- Multi-modal Memory: Agents can remember not only text but also images from conversations (e.g., product diagrams) and audio (e.g., customer tone). This has practical value for industrial after-sales, design review, etc.
- Memory-Based Proactive Reasoning: The agent actively retrieves long-term memory during idle time, discovers unresolved issues previously mentioned by the user, and proactively initiates follow-up.
- Federated Memory: Multiple agent instances share a long-term memory repository, but data security and conflict resolution need to be addressed.
For example, Customer Service Agent A and Agent B encounter the same customer; how to avoid memory contradictions.
9.4 Reference Materials
- LangGraph Official Memory Documentation: [Link to be added in internal wiki]
- PostgreSQL AI Extensions: pgvector, pg_later
- Vector Database Selection Comparison: FAISS vs Chroma vs Milvus Performance Benchmark Table
- LangChain Memory Module Source Code Analysis
Document completed. If you have any questions or need additional details, please contact the author or the respective team.
Summary
Through this article, you should now have a deeper understanding of “Building Long-Term Memory for Agents.” It is recommended to practice more with actual projects. If you have any questions, feel free to discuss!