Practical Guide to Building Long-Term Persistent Memory for Agents - Internal Knowledge Base Document

1. Introduction

In real-world agent deployments, stateless design is the primary cause of degraded multi-turn interaction quality. Repetitive user questions, agents forgetting previous instructions, and the inability to leverage historical experience across sessions—these problems fundamentally stem from a missing memory system. This article revolves around the theory of memory layering, uses LangGraph as the main code thread, explains the complete setup process for short-term and long-term memory, and introduces PostgreSQL as a structured persistence solution.

Upon completion, readers will master: the architectural layering design of memory systems, context window management and state maintenance for short-term memory, vectorized storage and semantic retrieval for long-term memory, and lifecycle management strategies for memory data.

2. Memory Layering Model: Architectural Design from Short-Term to Long-Term

2.1 Three-Level Memory Model

The design of an agent’s memory system draws from the theory of memory layering in cognitive science. In practice, we generally divide it into three levels:

Short-term Memory is responsible for storing the raw input and output within the current session, typically the historical messages within the LLM’s context window. Its characteristics are frequent reading and writing, limited capacity (constrained by the model’s token limit), and it is mainly used to maintain the coherence of a single conversation.

Working Memory tracks the status of tasks the agent is currently executing. For example, if a user asks “first check the weather, then book a hotel,” working memory maintains a task stack, recording which step is currently being executed and which subtasks have been completed. When the agent needs to temporarily switch to a subtask (e.g., checking COVID testing policies while checking the weather), working memory ensures the main task can be resumed.

Long-term Memory stores important knowledge across sessions, including user preferences, business rules, and key summaries of historical conversations. The carriers for long-term memory are typically external storage systems (vector databases, relational databases) that need to support persistence, efficient retrieval, and expiration cleanup.

The collaborative relationship among the three levels during agent runtime is as follows: After receiving user input, the agent first retrieves relevant context from long-term memory, concatenates it with short-term memory, and sends it to the LLM. Meanwhile, working memory manages the execution flow of the current task stack. Any valuable interaction result, after importance evaluation, is written back to long-term memory.

2.2 Enterprise Necessity of Long-Term Memory

In enterprise agent scenarios, long-term memory is not just a feature enhancement but a prerequisite for system availability. Below are typical cross-session requirements:

User Preference Learning: User A prefers “detailed technical documentation,” User B prefers “concise summary descriptions.” Long-term memory needs to record these preferences and automatically apply them in subsequent interactions.
Knowledge Reuse: A department confirmed a specific technical solution during client communication last month. During the next interaction, the agent should be able to reference the historical minutes to avoid redundant discussion.
Contextual Reasoning: A user asks about “login failure issues” on Monday and then asks “Do I still need to update the certificate?” on Tuesday. The agent needs to associate this with the login failure scenario from the previous day’s long-term memory to provide the correct advice.

Without long-term memory, an agent is essentially an empty shell model that reinitializes with every conversation, unable to accumulate experience or build user trust.

3. Short-Term Memory Implementation: Context Window Management and State Maintenance

3.1 Sliding Window Truncation Strategy

The most basic management strategy for short-term memory is the sliding window—keeping only the most recent N rounds of conversation. Suppose the business requires the agent to remember the last 5 rounds of user input and assistant responses, then the window size is 10 messages (5 inputs, 5 outputs). Messages outside this range are simply discarded.

Below is an example implementation of BufferWindowMemory based on LangGraph:

from typing import List, Dict

class BufferWindowMemory:
    """Sliding window short-term memory, keeping the last N rounds of conversation"""
    def __init__(self, max_turns: int = 5):
        self.messages: List[Dict] = []
        self.max_turns = max_turns
    
    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        # Trim window: exceed max_turns * 2 (user+assistant) then discard earliest
        if len(self.messages) > self.max_turns * 2:
            self.messages = self.messages[-(self.max_turns * 2):]
    
    def get_context(self) -> List[Dict]:
        return self.messages

Note: This method is simple and efficient but has two limitations: (1) if a turn is very long (e.g., the user pastes an entire log), the window can quickly be filled by a single message; (2) the model cannot access earlier context that has been discarded, potentially leading to “forgetting” issues.

3.2 Summary Compression Strategy

To address the “one-size-fits-all” problem of the sliding window, we can introduce ConversationSummaryMemory—regularly generating summaries of earlier conversations and replacing the raw messages in the window.

from langchain.memory import ConversationSummaryMemory
from langchain.llms import OpenAI

summary_memory = ConversationSummaryMemory(
    llm=OpenAI(model="gpt-4"),
    max_token_limit=500  # Summary not exceeding 500 tokens
)

# After each turn, automatically generate/update summary
summary_memory.save_context(
    {"input": "What's the weather like in Beijing?"}, 
    {"output": "It's sunny today, temperature 20°C."}
)
# When retrieving, returns: summary (earlier) + raw messages (recent)
context = summary_memory.buffer

This approach is more stable in long conversations but increases LLM call overhead (needs a summary for each save). In practice, it’s recommended to only generate a summary when window truncation is triggered, rather than after every turn.

3.3 Boundaries of State Management

Although short-term memory can maintain conversational coherence, it cannot solve the “forgetting” problem across sessions. For example, if a user returns the next day and asks, “What do you think of the plan I mentioned yesterday?” short-term memory will have been cleared. Long-term memory must be relied upon to provide the previous day’s discussion. This is the boundary of short-term memory’s responsibility—it is limited to the current session and should not manage cross-session knowledge.

4. Long-Term Memory Storage Solution: Combined Vector Database + PostgreSQL Design

4.1 Dual-Carrier Storage Architecture

Long-term memory needs to support two retrieval modes simultaneously: semantic retrieval (fuzzy search for “experiences about database optimization”) and structured retrieval (exact query for “User A’s configuration preferences”). A single storage solution often cannot achieve both. In practice, we use a combined approach of vector databases (FAISS/Chroma) and relational databases (PostgreSQL):

Storage Carrier	Applicable Scenario	Advantages	Disadvantages
Vector Database	Semantic similarity search	Fast retrieval, supports embedding	Weak structured querying
PostgreSQL	Exact queries, transaction management, reporting	Strong transactions, ACID	Full-text search not as good as vector retrieval

4.2 Dual-Write Model Design

When the agent extracts a piece of memory to be stored, it simultaneously writes to both systems:

Convert the memory content into a vector (via an embedding model like text-embedding-ada-002) and store it in the vector database index.
Write the memory’s raw text, entity tags, timestamp, importance score, and other structured fields into the PostgreSQL table.

When retrieving, first get the Top K similar memory list from the vector database, then use PostgreSQL to retrieve the complete fields of those memories. This balances retrieval speed and data integrity.

Example of core PostgreSQL table structure:

CREATE TABLE agent_memories (
    id SERIAL PRIMARY KEY,
    user_id VARCHAR(64) NOT NULL,         -- User/session identifier
    memory_key VARCHAR(256) NOT NULL,      -- Memory entity (e.g., "user_preference_office_hours")
    memory_value TEXT NOT NULL,            -- Memory content
    importance FLOAT DEFAULT 0.5,          -- Importance score [0,1]
    vector_id VARCHAR(64),                 -- Corresponding index ID in vector database
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,                  -- TTL expiration time
    is_active BOOLEAN DEFAULT TRUE
);

CREATE INDEX idx_memories_user ON agent_memories(user_id);
CREATE INDEX idx_memories_key ON agent_memories(memory_key);
CREATE INDEX idx_memories_expires ON agent_memories(expires_at) WHERE is_active = TRUE;

Engineering tip: When writing to PostgreSQL, use a connection pool (e.g., psycopg2‘s ThreadedConnectionPool) to avoid the overhead of creating a new connection for each storage operation. For vector database writes, it is recommended to use asynchronous APIs to avoid blocking the main flow.

5. Memory Lifecycle Management: Importance Evaluation and Automatic Cleanup

5.1 Why Lifecycle Management is Needed

Uncontrolled growth of long-term memory leads to two problems: storage bloat (increasing costs) and retrieval noise (a large number of irrelevant memories interfering with LLM reasoning). Memories must be graded and have expiration cleanup.

5.2 LLM-Based Importance Evaluation

Use an LLM to evaluate whether the information in the user input is worth storing in long-term memory. Example evaluation prompt:

You are a memory scoring system. Determine whether the following user message contains information "worth remembering long-term".
Evaluation criteria (0-1):
- 0.0~0.3: Casual chat, greetings, temporary instructions, no need to record
- 0.4~0.6: Temporary preferences or one-off phrases, no need for long-term storage
- 0.7~0.9: Clear personal preferences, business rules, project information, worth storing
- 1.0: Key decisions, configuration changes, security policies, must be permanently retained

User message: {user_input}

Return only a floating-point number between 0 and 1.

Set TTL (Time To Live) based on the score:

def calculate_ttl(importance: float) -> int:
    if importance >= 0.9:
        return -1  # Permanent retention
    elif importance >= 0.7:
        return 90 * 24 * 3600  # 90 days
    elif importance >= 0.4:
        return 30 * 24 * 3600  # 30 days
    else:
        return 7 * 24 * 3600   # 1 week

Note: Do not evaluate every message; only those with clearly declarative information (e.g., “I live in Beijing”) to avoid unnecessary LLM calls.

5.3 Scheduled Cleanup Script

Execute the cleanup task early every morning:

import psycopg2
from datetime import datetime, timedelta

def clean_expired_memories():
    conn = psycopg2.connect("dbname=agent_memory")
    cur = conn.cursor()
    # Hard delete: clean expired and low-importance memories
    cur.execute("""
        DELETE FROM agent_memories
        WHERE expires_at IS NOT NULL
          AND expires_at < NOW()
          AND importance < 0.7
    """)
    # Soft mark: for high-importance memories, extend TTL instead of direct deletion
    cur.execute("""
        UPDATE agent_memories
        SET expires_at = NOW() + INTERVAL '90 days'
        WHERE expires_at < NOW()
          AND importance >= 0.7
          AND is_active = TRUE
    """)
    conn.commit()
    cur.close()
    conn.close()

Strategy philosophy: high-value memories are only extended, never deleted; low-value memories are purged when they expire. Also keep an audit log of deletion operations for traceability.

6. Adding Long-Term Memory to Agents: Practical Memory Module Implementation in LangGraph

6.1 Embedding Memory Nodes in the Graph

LangGraph allows encapsulating memory read/write operations as independent graph nodes that are automatically triggered during the agent’s execution flow. Below is a typical design:

import langgraph as lg
from langgraph.executor import AgentExecutor
from langgraph.prompt import SystemMessage
from typing import Dict, Any

# 1. Define memory extraction node
def extract_memory(state: Dict[str, Any]) -> Dict[str, Any]:
    """Identify storable information from the current user input and write to PostgreSQL"""
    user_input = state["current_input"]
    importance = llm_evaluate_importance(user_input)
    if importance >= 0.4:
        # Extract entity and preference (using NER or simple rules)
        memory_key = extract_entity(user_input)  # e.g., "user_location"
        memory_value = user_input
        write_to_postgres(state["user_id"], memory_key, memory_value, importance)
        state["memory_extracted"] = True
    return state

# 2. Define memory retrieval node
def retrieve_memory(state: Dict[str, Any]) -> Dict[str, Any]:
    """Retrieve related memories before graph execution, inject into System Prompt"""
    user_input = state["current_input"]
    # Vector retrieval
    similar_memories = vector_store.search(user_input, top_k=3)
    # Structured retrieval
    exact_memories = query_postgres(state["user_id"], keys=["user_preference"])
    # Merge memories
    merged = format_memories(similar_memories + exact_memories)
    # Inject into System Prompt
    state["system_prompt_suffix"] = f"Below are memories based on user history: {merged}"
    return state

# 3. Build LangGraph
graph = lg.Graph()
graph.add_node("retrieve_memory", retrieve_memory)
graph.add_node("agent_decision", agent_llm_call)
graph.add_node("execution_loop", tool_executor)
graph.add_node("extract_memory", extract_memory)
graph.add_edge("retrieve_memory", "agent_decision")
graph.add_edge("agent_decision", "execution_loop")
graph.add_edge("execution_loop", "extract_memory")
graph.set_entry_node("retrieve_memory")
graph.set_finish_node("extract_memory")
executor = AgentExecutor(graph)

6.2 Key Engineering Details

Retrieval Timing: The retrieve_memory node is placed at the very beginning of the graph execution to ensure the LLM has historical context before making decisions. The extract_memory write node is placed after the graph execution ends to avoid interfering with the current round’s logic.
Deduplication Strategy: When the same user repeatedly generates memories for the same entity, it should be updated rather than appended. Use ON CONFLICT statements in PostgreSQL for upsert.
Error Handling: If PostgreSQL or the vector database write fails, the entire agent execution should not fail. Catch exceptions inside the extract_memory node and define a degradation strategy using graph.set_error_handler (e.g., log the error and continue execution).

7.1 Hybrid Memory Input

The LLM’s input includes two parts of memory: current session short-term memory (raw messages) and long-term memory retrieval results. How they are fused directly impacts reasoning quality. Recommended strategy:

1
2
3

Final input = Short-term memory (raw messages, keeping last N rounds)
            + Long-term memory (Top K relevant, sorted by importance)
            + Working memory (current task stack status)

In LangChain, this can be implemented via a custom Memory class:

from langchain.memory import CombinedMemory
from langchain.schema import BaseMemory

class HybridMemory(BaseMemory):
    short_memory: BufferWindowMemory
    long_retriever: Any  # Vector retriever
    
    @property
    def memory_variables(self) -> List[str]:
        return ["short_context", "long_context"]
    
    def load_memory_variables(self, inputs: Dict) -> Dict:
        short_context = self.short_memory.get_context()
        query = inputs.get("input", "")
        long_context = self.long_retriever.get_relevant_documents(query)
        return {
            "short_context": short_context,
            "long_context": [d.page_content for d in long_context]
        }

7.2 Fusion Sorting and Re-Ranking

When using both semantic retrieval and keyword retrieval, the results need to be fused and sorted. RRF (Reciprocal Rank Fusion) is a lightweight fusion method:

def rrf_fusion(results: List[List[Any]], k: int = 60) -> List[Any]:
    scores = {}
    for rank, doc in enumerate(results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (rank + k)
    return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)

Re-ranking: After fusion, use a lightweight model (e.g., cross-encoder) to re-score and sort the Top 20 results, further compressing noise. This step can be omitted in latency-sensitive scenarios but is recommended for offline batch scenarios.

7.3 Time Decay Factor

The timeliness of memory is also important. A “user preference” from a year ago has different reliability compared to one from a day ago. Add a time decay factor during retrieval:

1
2
3

def time_decay_score(original_score: float, age_hours: float) -> float:
    decay_factor = 0.98  # Decay 2% per hour
    return original_score * (decay_factor ** age_hours)

For business rule memories, the decay factor is usually set to 1 (no decay); for temporary preference memories, decay can be faster.

8. Pitfall Logging and Performance Optimization Practices

8.1 Common Issue Checklist

1. Memory Write/Read Bottleneck

Phenomenon: Every time the agent executes, it must first retrieve long-term memory. Under high concurrency, PostgreSQL query latency increases from 1ms to 200ms, doubling the overall pipeline latency.

Cause: Connection pool not used, no partition indexing on the memory table.

Optimization Plan:

Force use of connection pool (e.g., psycopg2.pool.ThreadedConnectionPool), min connections = concurrency × 2.
Hash-partition by user_id, each partition with its own index.

2. High Vector Dimension Leading to Retrieval Latency

Phenomenon: Using a 1536-dimensional embedding model, retrieval on 1 million records takes over 300ms, failing real-time conversation requirements.

Optimization Plan:

Dimensionality reduction: compress embedding from 1536 to 256 dimensions (using PCA or autoencoder), sacrificing a small amount of precision for more than 5x speed improvement.
Index sharding: use IVF (Inverted File) index, set nprobe=20 (only search 20 most relevant cluster centers).

3. PostgreSQL Connection Pool Too Small

Phenomenon: After scaling agent instances from 5 to 20, PostgreSQL connections are still only 10, causing many requests to queue.

Optimization Plan: Dynamically expand the connection pool proportional to the number of agent instances. Upper limit should not exceed the database’s acceptable number of connections (usually 500 or less).

8.2 Performance Optimization Test Data

Below are internal stress test results (8-core 32G machine, 100k memories, concurrency 30):

Optimization Measure	Average Response Time	Memory Usage	Notes
No optimization (full table scan)	1200ms	600MB	Unacceptable
Add indexes + connection pool	230ms	650MB	Acceptable
Async writes + cache layer	85ms	700MB	Recommended
Cache layer + vector index sharding	45ms	800MB	High throughput scenario

Cache layer uses Redis, storing the most recent 1000 high-frequency memories to avoid repeated PostgreSQL queries. On writes, first write to Redis (1ms), then asynchronously batch write to PostgreSQL (every 5 seconds or when 20 items have accumulated).

9. Summary and Expansion Directions

9.1 Key Technical Points Review

This article has completely covered the setup process for agent long-term persistent memory, with key points as follows:

Layered Memory Architecture: Short-term memory (conversation context), working memory (task state stack), long-term memory (knowledge base), each layer with its own responsibilities and lifecycle.
Short-term Memory Window Management: Two strategies—sliding window truncation and summary compression—choose or combine based on business scenario.
Long-term Memory Multi-Modal Storage: Vector database for semantic retrieval, PostgreSQL for structured transactional storage, dual-write ensures data integrity and retrieval efficiency.
Lifecycle Management: LLM evaluates importance → sets TTL → scheduled cleanup strategy to avoid storage bloat and retrieval noise.
Retrieval Enhancement Fusion: Short-term + long-term memory hybrid input, RRF fusion sorting, time decay factor improves relevance.
Performance Optimization: Async writes, connection pools, cache layer, index sharding to ensure response stability under high concurrency.

9.2 Implementation Suggestions

Implement Short-term Memory First: Most agent applications perform better in the short term (within a single session). Short-term memory is the foundation and can be integrated within two weeks.
Layer in Long-term Memory Gradually: Start with structured data like user preferences and business rules, then expand to free-text memory extraction. Do not launch all features at once.
Monitor Memory Hit Rate: Log whether each retrieval returns valid memory. Hit rates below 30% indicate that the retrieval strategy or memory population strategy needs optimization.
Reserve Extension Interfaces: Future integration of multi-modal memory (images/audio) may be needed. In storage design, use extensible field types like JSONB (PostgreSQL).

9.3 Expansion Directions

Multi-modal Memory: Agents can remember not only text but also images from conversations (e.g., product diagrams) and audio (e.g., customer tone). This has practical value for industrial after-sales, design review, etc.
Memory-Based Proactive Reasoning: The agent actively retrieves long-term memory during idle time, discovers unresolved issues previously mentioned by the user, and proactively initiates follow-up.
Federated Memory: Multiple agent instances share a long-term memory repository, but data security and conflict resolution need to be addressed.

For example, Customer Service Agent A and Agent B encounter the same customer; how to avoid memory contradictions.

9.4 Reference Materials

LangGraph Official Memory Documentation: [Link to be added in internal wiki]
PostgreSQL AI Extensions: pgvector, pg_later
Vector Database Selection Comparison: FAISS vs Chroma vs Milvus Performance Benchmark Table
LangChain Memory Module Source Code Analysis

Document completed. If you have any questions or need additional details, please contact the author or the respective team.

Summary

Through this article, you should now have a deeper understanding of “Building Long-Term Memory for Agents.” It is recommended to practice more with actual projects. If you have any questions, feel free to discuss!