Agent Integration with Local Knowledge Base for RAG Combined Use

1. Introduction

Traditional RAG lacks proactive reasoning and multi-step retrieval capabilities in complex Q&A scenarios. This article introduces how to integrate Agent with a local knowledge base RAG system to achieve more intelligent Q&A. By reading this article, you will master: the core concepts of Agentic RAG, building a local RAG knowledge base based on Langchain4j + PGVector, implementation steps for integrating Agent with multi-turn conversations, and advanced approaches to optimize retrieval performance using DSPy.

2. Core Concepts: RAG and Agentic RAG

2.1 RAG Fundamentals

RAG (Retrieval-Augmented Generation) is a technical paradigm that combines information retrieval with text generation. Its core process consists of three steps:

Retrieve: When a user asks a question, the system first encodes the question into a vector, then performs an approximate nearest neighbor (ANN) search in the knowledge base’s vector index to retrieve document fragments most semantically relevant to the question.
Augment: The retrieved document fragments are concatenated with the user’s original question as contextual information to form an augmented prompt.
Generate: The augmented prompt is fed into a large language model, which generates the final answer based on the question and contextual information.

The main advantages of RAG are reflected in three aspects:

Reduced Hallucination: By providing factual evidence through an external knowledge base, the LLM does not need to rely on its own parametric memory, significantly reducing the probability of fabricating information.
Knowledge Updates: Only the knowledge base documents or vector index need to be updated, without retraining or fine-tuning the model, making the cost much lower than full fine-tuning.
Domain-Specific Knowledge: Enterprises can build domain-specific knowledge bases (such as technical documentation, internal policies, product manuals) to enable general LLMs to answer questions in vertical domains.

In practice, RAG systems also need to handle document chunking strategies, embedding model selection, retrieval result re-ranking, and other issues. For example, document chunking should consider the integrity of paragraph semantics, embedding model selection should be based on language, text length, and budget, and retrieval results can use RRF (Reciprocal Rank Fusion) to fuse vector retrieval and keyword BM25 results to improve recall quality.

2.2 Evolution to Agentic RAG

Although traditional RAG solves the problem of external knowledge, it still has significant limitations in complex Q&A scenarios. Take a query like “Find the part about R&D investment in the 2024 financial report, compare it with the 2023 data, and give the growth rate.” Traditional RAG may not be able to complete multi-condition filtering and cross-document comparison in a single step. This is exactly the problem Agentic RAG aims to solve.

Agentic RAG combines large language models with intelligent Agents. Instead of passively performing a single retrieval and then generating, the Agent actively thinks, plans, executes, and interacts with the user like an intelligent assistant to ultimately complete the task. Specifically, Agentic RAG introduces the following capabilities:

Proactive Retrieval Strategy Planning: The Agent can decompose a complex problem into multiple sub-queries. For example, for the financial question above, the Agent might first retrieve “2024 financial report R&D investment,” then retrieve “2023 financial report R&D investment,” and then ask another tool to calculate the growth rate.
Using Tools: The Agent can invoke multiple tools, such as local RAG knowledge base queries, web searches, API calls, code executors, database queries, etc. Each tool has a clear description and interface definition.
Managing Multi-Turn Conversation Context: The Agent can remember the conversation history and automatically associate entities or constraints mentioned earlier in subsequent questions. For example, if a user asks, “What is the specific cost of the plan mentioned earlier?” the Agent needs to identify what “that plan” refers to from the history.
Error Recovery and Exception Handling: When a tool call fails or returns an abnormal result, the Agent can attempt alternative strategies or ask the user for clarification.

Currently, there are mature frameworks that support building Agentic RAG, such as Microsoft’s AutoGen (for building multi-agent workflows based on LLMs), Langchain4j (an Agent framework in the Java ecosystem), and LangChain (an Agent framework in the Python ecosystem).

2.3 Scenario Comparison

Scenario Characteristics	Recommended Solution	Description
Simple single-fact query	Traditional RAG	e.g., “What is the company’s registered address?”
Multi-condition filtering	Agentic RAG	e.g., “Find customers with sales exceeding 1 million in the East China region in Q1 2024”
Requires multi-step reasoning	Agentic RAG	e.g., “Analyze the trend of gross margin changes over the last three quarters”
Involves external system calls	Agentic RAG	e.g., “After checking order status, call the CRM system to update the customer tier”
Sensitive to latency	Traditional RAG	The multi-step reasoning of Agent increases response time

3. Building a Local Knowledge Base: Based on Langchain4j + PGVector

3.1 Rationale for Technology Selection

When building a local knowledge base RAG, technology selection needs to consider factors such as ecosystem compatibility, data consistency, and deployment complexity. Langchain4j is an LLM integration framework in the Java/Spring ecosystem, providing convenient APIs for RAG workflows. PGVector is a vector retrieval extension for PostgreSQL that allows storing and querying vectors directly within a relational database.

Key reasons for choosing PGVector:

Data Consistency: Vector data and business relational data are stored in the same database, making it easier to maintain metadata relationships and transactional operations.
Enterprise-Friendly Deployment: Most enterprises already have PostgreSQL deployed internally, eliminating the need to introduce additional specialized vector databases (such as Milvus, Qdrant), reducing operational complexity.
Functional Completeness: Supports multiple index types (IVFFlat, HNSW), multiple distance metrics (cosine, Euclidean, dot product), with acceptable performance at the million-vector scale.

3.2 Outline of Build Steps

Step 1: Install PGVector and Create the Vector Table

Ensure PostgreSQL version ≥ 12, then install the PGVector extension:

1	`CREATE EXTENSION IF NOT EXISTS vector;`

When creating the vector table, define the column type based on the output dimension of the embedding model. For example, using the bge-m3 model (output 1024 dimensions):

CREATE TABLE knowledge_chunks (
    id SERIAL PRIMARY KEY,
    chunk_content TEXT NOT NULL,
    metadata JSONB,
    embedding vector(1024)
);

Note: The dimension in vector(1024) must exactly match the output dimension of the embedding model.

Step 2: Configure Langchain4j’s Vector Store

Use Langchain4j’s PgVectorEmbeddingStore to connect to PostgreSQL:

EmbeddingStore<TextSegment> embeddingStore = PgVectorEmbeddingStore.builder()
    .host("localhost")
    .port(5432)
    .database("knowledge_base")
    .user("postgres")
    .password("password")
    .table("knowledge_chunks")
    .dimension(1024)
    .build();

For quick verification during development, you can also use EmbeddingStoreInMemory:

1	`EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();`

Step 3: Document Chunking and Embedding Model Selection

The strategy for document chunking directly affects retrieval quality. Practical recommendations:

Text documents: Use natural paragraph breaks, preserving context such as headings and subheadings in each chunk.
Tabular data: Process separately, preserving headers and row/column structure.
Code snippets: Chunk by function or class, preserving comments.

Example chunking parameters:

DocumentSplitter splitter = DocumentSplitters.recursive(
    paragraph(paragraph -> paragraph.length() < 1000),
    sentence(sentence -> sentence.length() < 200)
);

Recommended embedding models: choose open-source models that support multiple languages, such as bge-m3 (by BAAI, supports 1024 dimensions, multilingual) or BAAI/bge-large-zh-v1.5 (priority for Chinese scenarios). Using the embedding model in Langchain4j:

1 2	`EmbeddingModel embeddingModel = new BgeSmallZhEmbeddingModel(); // Or use remote models like OpenAI / Ollama`

Step 4: Write to Vector Store and Verify Retrieval Performance

List<TextSegment> segments = splitter.split(document);
for (TextSegment segment : segments) {
    Embedding embedding = embeddingModel.embed(segment.text()).content();
    embeddingStore.add(embedding, segment);
}

For retrieval validation, focus on the Recall@k metric (the proportion of correct results included in the top-k results). For common knowledge Q&A scenarios, Recall should reach above 0.85 when k=5.

3.3 Key Technical Details

Connection Pool Configuration: In production, use the HikariCP connection pool with a minimum idle connection of 5 and a maximum connection count of 30.
Similarity Search: PGVector supports <-> (L2 distance), <=> (cosine distance), and <#> (dot product). For semantic retrieval, cosine distance is the most commonly used metric because it is insensitive to vector length.
Retrieval Result Re-Ranking (RRF Fusion): When using both vector retrieval and BM25 keyword retrieval, fuse the rankings using the RRF formula:

// Pseudo code: get vector retrieval results and BM25 results separately
List<ScoredDocument> vectorResults = vectorStore.search(...);
List<ScoredDocument> keywordResults = bm25Search(...);
// RRF fusion
List<ScoredDocument> fusedResults = ReciprocalRankFusion.merge(vectorResults, keywordResults);

Tip: During initial setup, using an IVFFlat index (ivfflat lists = 100) is sufficient for millisecond-level retrieval at the million-vector scale; if the data volume reaches tens of millions, it is recommended to upgrade to an HNSW index (hnsw ef_construction = 200).

4. Agent Integration Practice: Using Local RAG as an Agent Tool

4.1 Agent Architecture Design

A typical internal Agent structure includes:

Main Agent: Responsible for parsing user input, deciding which tools to call, and managing multi-turn conversation context.
Tool Registry: A list of all available tools, each with a unique name, description, and calling interface.
Memory Module: Maintains conversation history to provide contextual information to the Agent.

The local RAG knowledge base is registered as a retrieval tool within the Agent. When the Agent determines that the user’s question requires a knowledge base query, it calls this tool.

4.2 Main Tool Types

In Agent practice, besides the RAG retrieval tool, the following types of tools are often registered:

RAG Query Tool: Retrieves relevant document fragments from the local knowledge base.
Web Search Tool: Obtains external information via search engines.
Code Execution Tool: Runs Python/SQL code and returns the results.
API Call Tool: Calls internal REST APIs to retrieve business data.

4.3 Implementation Steps

4.3.1 Define the RAG Tool Interface

In Langchain4j, use the @Tool annotation to define a tool:

@Tool("Retrieves relevant information from the local knowledge base based on the query content, returning the most relevant document fragments")
public String queryLocalKnowledgeBase(@P("Query content") String query) {
    // 1. Encode the query text into a vector
    Embedding queryEmbedding = embeddingModel.embed(query).content();

    // 2. Retrieve the top-k most relevant fragments from the vector store
    List<EmbeddingMatch<TextSegment>> matches = embeddingStore.findRelevant(
        queryEmbedding, 5 // k = 5
    );

    // 3. Concatenate retrieval results
    StringBuilder context = new StringBuilder();
    for (EmbeddingMatch<TextSegment> match : matches) {
        context.append(match.embedded().text()).append("\n---\n");
    }

    return context.toString();
}

Note: The tool description should be clear and explicit so that the Agent can correctly determine when to use it. In practice, you can add a usage example placeholder (e.g., “Use this tool when the user asks about company policies or product manuals”).

4.3.2 Create the Agent

Use the Assistant mode to create an Agent and bind it to the LLM and tools:

Assistant agent = AiServices.builder(Assistant.class)
    .chatLanguageModel(chatLanguageModel) // Can be OpenAI, Ollama, etc.
    .tools(new KnowledgeBaseTool(embeddingModel, embeddingStore))
    .chatMemory(MessageWindowChatMemory.withMaxMessages(20))
    .build();

4.3.3 Execute Multi-Turn Conversation

// First turn
String response1 = agent.chat("Please explain how RAG works");
System.out.println(response1);

// Second turn, leveraging historical context
String response2 = agent.chat("What are its advantages in reducing hallucination?");
System.out.println(response2);
// Agent determines from history that "its" refers to RAG and returns relevant content

4.4 Common Pitfalls and Tuning Tips

Tool Return Value Format: The string returned by a tool should be as structured as possible, including document source or ID for the Agent to reference later. For example, prefix the result with [Source: Document A].
Timeout Settings: Tool calls should have a timeout set (recommended 15 seconds) to prevent the Agent from hanging due to a non-responsive tool.
Exception Handling: Catch exceptions inside the tool method and return a friendly error message, rather than throwing an exception that interrupts the Agent’s execution.

5. Multi-Turn Dialogue Enhancement: Context Fusion with Agent + RAG

5.1 Problem Analysis

In traditional RAG, each query is independent, losing historical conversation information. For example, a user asks, “What was the revenue in Q1 2024?” and then follows up with, “How much did it grow compared to the previous quarter?” Without historical context, RAG cannot know that “previous quarter” refers to Q4 2023.

5.2 Context Fusion Approaches

When constructing a retrieval query, the Agent automatically appends key entities and constraints from the multi-turn conversation. Two common implementation methods exist:

Approach 1: Leverage the LLM’s dialogue summarization capability

Ask the LLM to summarize the previous dialogue and then generate a complete description of the current query. Langchain4j’s MessageWindowChatMemory automatically manages message history, and the Agent includes the conversation history in the system prompt.

Before calling the RAG tool, a “query construction” step can be embedded:

// Pseudo code: Before calling the RAG tool, optimize the query based on conversation history
String optimizedQuery = chatLanguageModel.generate(
    String.format("Based on the following conversation history, generate a complete retrieval query for the current user question." +
                   "Conversation history: %s\nCurrent question: %s",
                   conversationHistory, currentUserInput)
);
// Then use optimizedQuery to call the knowledge base retrieval

Approach 2: Manually maintain structured history

Maintain a structured summary (e.g., HashMap<String, Object>) in the Agent’s context, recording key entities mentioned by the user (such as contract numbers, time ranges, product names). Before calling a tool, extract relevant constraints from the summary and inject them into the query.

5.3 Implementation Example

When using Langchain4j’s ConversationMemory, simply configure it when creating the Agent:

ChatMemory chatMemory = MessageWindowChatMemory.builder()
    .maxMessages(20) // Keep the last 20 messages
    .build();

Assistant agent = AiServices.builder(Assistant.class)
    .chatLanguageModel(chatLanguageModel)
    .tools(new KnowledgeBaseTool())
    .chatMemory(chatMemory) // Context fusion is built-in
    .build();

5.4 Precautions

Avoid Overly Long Context: Conversation history consumes tokens quickly as the number of turns increases. You can use a sliding window (e.g., the last 10 turns) or summarize history (keeping only key entities and question summaries).
Periodically Clean Confusing Context: If the user discusses multiple unrelated topics in one session, the Agent might retrieve the wrong context. At each user question, have the Agent determine whether the current query has switched topics.

6. Advanced Tips: Using DSPy to Optimize RAG Retrieval and Generation Performance

6.1 Introduction to DSPy

DSPy (Declarative Self-improving Language Programs) is a declarative framework for building and optimizing RAG systems, developed by the Stanford NLP group. Its core idea is: developers declaratively describe the goals they want the program to achieve, and DSPy automatically optimizes prompts and model parameters to achieve the best performance.

Compared to manually tuning prompts, DSPy’s optimization process is more systematic:

Define the input/output of the pipeline.
Declare modules (e.g., retriever, generator).
Provide a small number of labeled examples.
Automatically search for the best optimization strategy (e.g., query rewriting, few-shot example selection, prompt template optimization).

6.2 Typical Optimization Scenarios

Query Rewriting Strategy Optimization: The user’s original query may be too vague (e.g., “Introduce the product” is too broad). DSPy can automatically train a “query rewriter” to rewrite the user’s question into a form more suitable for retrieval (e.g., “Describe the features of the ERP product newly launched by the company in 2024”).

Generation Stage Prompt Optimization: DSPy can optimize the prompt at the generation stage, such as automatically selecting the optimal number of examples for different tasks or adjusting instruction phrasing.

6.3 Integration Path Suggestions

DSPy currently mainly supports the Python ecosystem. For teams using the Java technology stack, consider the following integration paths:

Approach 1: Standalone Python Service: Encapsulate the DSPy-optimized RAG pipeline as a microservice and communicate with the Java Agent via gRPC/HTTP.
Approach 2: Adopt DSPy’s Optimization Results: Train the optimal query rewriting rules or prompt templates locally, then hardcode the rules into the Java Agent.

Example: Using DSPy to optimize query rewriting (Python pseudo code)

import dspy
from dspy import signature, Module, Predict

class RewriteQuery(signature.Signature):
    """Rewrite the user's original query into a more detailed query suitable for knowledge base retrieval"""
    original_query = dspy.InputField()
    rewritten_query = dspy.OutputField()

class RAGPipeline(Module):
    def __init__(self):
        self.rewrite = Predict(RewriteQuery)
        self.retrieve = dspy.Retrieve(k=5)
        self.generate = dspy.ChainOfThought("context, question -> answer")

def forward(self, question):
        rewritten = self.rewrite(original_query=question).rewritten_query
        context = self.retrieve(rewritten).passages
        answer = self.generate(context=context, question=question)
        return answer

# Automatically optimize with a small set of labeled examples
optimizer = dspy.BootstrapFewShot(max_bootstrapped_demos=5)
compiled_rag = optimizer.compile(RAGPipeline(), trainset=trainset)

6.4 Performance Evaluation

The effect of DSPy optimization can be quantified by comparing retrieval and generation metrics before and after optimization:

Metric	Description	Before Optimization	After Optimization
Recall@5	Proportion of correct documents in the top 5 results	0.78	0.91
NDCG@5	Ranking quality, considering document relevance levels	0.65	0.82
Generation Accuracy	Match rate between final answers and human-labeled answers	0.72	0.88

The evaluation dataset should sample 200-500 real user questions with manually labeled correct answers to ensure reliable evaluation.

7. Pitfalls and Performance Tuning

7.1 Common Issues and Solutions

Slow retrieval due to improper index selection: PGVector supports IVFFlat and HNSW indexes. The IVFFlat index requires the lists parameter (recommended sqrt(data_size)). When data volume exceeds 5 million, switch to HNSW. When first building the index, ensure the sampling is complete:

-- IVFFlat index
CREATE INDEX ON knowledge_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
-- HNSW index (requires PGVector ≥ 0.5.0)
CREATE INDEX ON knowledge_chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200);

Agent tool calling loop errors: The Agent may repeatedly call the same tool (e.g., calling RAG query multiple times in a row, each time returning similar results). You can limit the maximum number of tool calls to break the loop, or have the tool return a “no new information” marker in its return value.

// Limit the maximum number of tool calls in a single Agent turn to 10
AgentExecutor executor = AgentExecutor.builder()
    .agent(reactAgent)
    .maxIterations(10)
    .build();

Historical information pollution in multi-turn dialogues: When the user discusses topic A and then suddenly switches to topic B, the Agent might incorrectly fuse context from topic A. The solution is to add a “topic switch detection” logic in the Agent’s decision-making process: when a query is detected to be unrelated to the current topic, clear the previous context.

7.2 Performance Optimization Directions

Domain-specific fine-tuning of the embedding model: Use high-quality domain data to fine-tune the embedding model, especially by constructing hard negatives. For example, for technical documentation, pairs like “Linux system installation” and “MySQL database installation” that are easily confused can serve as hard negatives. After fine-tuning, Recall@5 can improve by 5-15%.

// Pseudo code: Construct hard negatives
List<TrainingExample> examples = new ArrayList<>();
for (Document doc : domainDocuments) {
    // Query = doc title
    // Positive sample = doc content
    // Hard negative = another document with similar semantics but different topic
    examples.add(new TrainingExample(query, positiveDoc, randomNegativeDoc, hardNegativeDoc));
}

Metadata filtering + hybrid retrieval: In enterprise knowledge bases, metadata (such as document type, creation time, department) can effectively narrow the retrieval scope. For example, if the user asks “2024 R&D policy,” first filter with metadata year=2024 && type='policy', then perform vector retrieval.

Hybrid retrieval uses “vector similarity (0.7) + keyword BM25 (0.3)” weighted fusion to balance semantic matching and exact matching.

Replace the Agent’s LLM with a local small model: If the Agent’s response latency is too high, consider switching the LLM from GPT-4 to a local model (such as Qwen2.5-7B or Llama-3-8B). Although some complex reasoning capabilities may decrease, latency can drop from 3 seconds to 0.5 seconds. For simple scenarios (like tool calling), a local small model is usually sufficient.

8. Summary and Future Directions

8.1 Core Recap

This article has thoroughly introduced the engineering path for integrating Agents with local knowledge base RAG. The key points are as follows:

Building a Local RAG Knowledge Base: Implementing end-to-end vector storage and retrieval based on Langchain4j + PGVector, including plugin installation, table structure design, document chunking strategy, embedding model selection, and index configuration.
Agent Tool Encapsulation: Encapsulating the RAG retrieval function as a callable Agent tool, defining its name, description, and parameters via the @Tool annotation, enabling the Agent to actively and multi-step call the knowledge base.
Multi-Turn Conversation Context Fusion: Using Langchain4j’s MessageWindowChatMemory to automatically manage conversation history, and leveraging the Agent’s planning capability to construct optimized queries before each retrieval.
DSPy Optimization: Improving retrieval accuracy and generation quality of the RAG pipeline through declarative programming and automated optimization.

8.2 Future Directions

Introducing Graph Embeddings and Knowledge Graphs: Map entities and relationships in the knowledge graph into the vector space, coexisting with document vectors in the same retrieval system. This allows the Agent not only to retrieve document fragments but also to directly retrieve knowledge about entity relationships (e.g., “Department A is responsible for after-sales maintenance of Product Line B”), enabling relational reasoning.
Multi-Agent Collaboration: One Agent is responsible for retrieving information, another for verifying answer accuracy, and a third for interacting with the user. Through frameworks like AutoGen, multiple Agents can collaborate via message passing to complete tasks, enhancing system reliability and scalability.
Offline Evaluation Pipeline: Establish a continuous offline evaluation pipeline using MTEB (Massive Text Embedding Benchmark) or other domain-specific benchmarks to track quality changes in the embedding model and RAG pipeline, preventing performance degradation due to model updates or data changes.

8.3 Reference Resources

GitHub Community Discussions: Technical principles and integration suggestions for introducing Agentic RAG or DSPy into local knowledge bases (search related Issue #1889).
Langchain4j Official Documentation: Provides complete API descriptions for Agents and RAG in the Java environment.
DSPy Official Tutorial: Contains rich RAG optimization cases covering Q&A, summarization, generation, and other scenarios.
“RAG Offline Part: Embedding Model Selection and Domain Adaptation Fine-Tuning” series of articles: Delves into embedding model selection methods, fine-tuning strategies, evaluation metrics, and advanced directions such as metadata enhancement and knowledge graph integration.

Summary

Through this article, we believe you have gained a deeper understanding of “Integrating Agents with Local Knowledge Bases.” We recommend practicing with real projects. If you have any questions, feel free to exchange!