RAG Online Part: Retrieval Optimization — HyDE and Query Expansion Techniques

Introduction: Why is your RAG retrieval always inaccurate? Starting with HyDE

Imagine this scenario: You’ve built an enterprise knowledge base RAG (Retrieval-Augmented Generation) system, carefully arranging thousands of medical guidelines. A user asks a seemingly simple question: “How should hypertensive patients adjust their diet?” After retrieval, the system returns the “Nutritional Guide for Marathon Runners.” Even more frustrating, when the user slightly rephrases the question, like “What are the dietary recommendations for hypertension?”, the system confidently returns an article about “Exercise Precautions for Hyperlipidemia Patients.”

This isn’t a fictional joke; it’s a common problem many RAG systems encounter in practice. The root cause often lies not in the large model itself, nor in the data quality of the knowledge base—though both are important—but in a semantic gap in the retrieval stage. Users are accustomed to expressing vague intentions in natural language (“How to eat better?”), while documents in the knowledge base exist in structured, professional language (“A low-sodium diet can lower systolic blood pressure by 4–5 mmHg”).

Traditional vector retrieval directly maps the user’s query into a high-dimensional vector space, then computes similarity with vectors of all documents. This “straightforward” matching approach struggles to capture the deep intent and descriptive details implicit in the query.

Thus, HyDE (Hypothetical Document Embeddings) emerged. It proposes a counterintuitive yet highly effective idea: Don’t use the user’s query for retrieval; instead, have the LLM first generate a hypothetical, optimal document that best matches the user’s intent, then use this “imagined” document as the retrieval probe. This may sound like “making things more complicated than necessary,” but precisely this clever “middleman” strategy significantly bridges the semantic gap between user queries and knowledge base documents.

This article will thoroughly deconstruct the principle of HyDE hypothetical document embeddings and guide you through a hands-on RAG query expansion technique. You will learn:

Deep Understanding: Why traditional vector retrieval frequently fails when faced with complex, vague queries.
Core Principle: How HyDE fundamentally improves retrieval relevance through a three-step process: Generate → Embed → Retrieve.
Hands-on Code: Implement a complete HyDE retrieval module using open-source models (Qwen2.5-7B + bge-large).
Advanced Techniques: Instruction engineering, hybrid retrieval (HyDE + BM25), and other tuning methods to take your HyDE to the next level.
Pitfall Guide: Common traps when implementing HyDE and how to evaluate its true benefits.

Whether you are a beginner in RAG systems or an experienced developer struggling with retrieval accuracy, this article will provide a set of ready-to-deploy solutions. Get ready, and let’s enter the core battlefield of RAG online retrieval optimization.

Core Principle: When the LLM “Imagines” the Best Document for Retrieval

Before diving into HyDE, we need to clearly understand: HyDE is an advanced variant of query expansion techniques. Traditional query expansion strategies, such as synonym substitution (expanding “computer” to “PC”, “desktop”), multi-query (generating multiple queries from different perspectives), etc., all focus on “rewriting” the user’s question itself. They attempt to cover more possible semantic expressions by enriching and varying the user input.

HyDE is completely different. Instead of rewriting the question, it directly generates a “blueprint” of the answer. In the HyDE framework, the Large Language Model (LLM) is no longer a simple query rewriter but a “document generator.” It is instructed: “Given the user’s question, assume you have already found the perfect answer; write a detailed text that answers the question in a style consistent with the knowledge base.” This generated text is the Hypothetical Document.

Subsequently, the RAG system discards the original query and uses the vector of this hypothetical document to search the knowledge base. Why do this? Because the semantics of the hypothetical document are closer to the style and content density of the real documents in the knowledge base. It is no longer the fragmented, colloquial “how to eat for hypertension” from the user, but a well-structured, information-dense piece like “Hypertensive patients should reduce sodium intake, eat more vegetables and fruits, control weight, and monitor blood pressure regularly.” When this quasi-document is used to match against the knowledge base, the hit rate naturally improves significantly.

The core insight of HyDE is: In the vector space, the semantic gap between a query and a document is much larger than the semantic gap between two documents. Instead of laboriously mapping the query into the document space, it’s better to let the LLM directly create a virtual “anchor” located in the document space, then search around that anchor.

To understand why HyDE works, we must first dissect the weaknesses of traditional direct vector retrieval.

First, Asymmetry in Length and Granularity. User queries are usually very short (a few words to tens of words), while documents or chunks in the knowledge base can be hundreds or even thousands of words long. When encoding, vector models need to pool over long texts (e.g., averaging or concatenation) to produce a fixed-length vector. A short query vector has extremely low information density, capturing only the most superficial keywords. When used in a dot product with a long document vector rich in information and complex structure, it is almost impossible to precisely locate the most relevant content.

It’s like aiming at a large mountain with the crosshairs of a toy gun; you end up hitting everywhere but missing the target.

Second, the Lexical Gap. Users and document authors use different vocabulary to express the same meaning. A user might say “How to save money on spending?”, while the document might say “Personal financial optimization strategies analysis.” Although general-purpose embedding models have learned some synonym relationships, they still struggle with domain-specific terms and expressions.

“Hypertrophic cardiomyopathy” and “HCM” are the same in the medical field, but in the vector space, their distance might be greater than between “hypertrophic cardiomyopathy” and “hypertension.” This lexical gap leads to retrieval results that are often “keyword-matching but semantically irrelevant.”

Note: The lexical gap is the primary reason RAG systems perform poorly in vertical domains (e.g., medical, legal, finance). General embedding models cannot fully understand the jargon of these domains.

Third, Intent Ambiguity. A query often “means something but is vague.” For example, a user asks “Apple’s profit.” The intent could be “Apple Inc.’s net profit,” “profit from growing apples,” or even “the channel profit of Apple (iPhones).” Traditional vector retrieval requires the model to “guess” the vague intent of the query, but the probability of guessing correctly is very low. It tends to match documents containing both keywords “Apple” and “profit,” regardless of how they relate contextually.

It is these blind spots that cause direct vector retrieval to perform poorly when faced with complex, vague, and implicit queries. The emergence of HyDE aims precisely to illuminate these blind spots.

2. HyDE’s Solution: Three Steps — Generate, Embed, Retrieve

The cleverness of HyDE lies in transforming the above problems into a “document generation” task. Let’s fully break down the step-by-step process of using HyDE:

Step 1: Generate — Make the LLM “Imagine” a Perfect Document

The key is to carefully design a prompt that guides the LLM to generate a “hypothetical document.” This document is not extracted from the knowledge base but is an “exemplary essay” generated by the LLM based on its vast pre-training knowledge. Key points of the prompt include:

Define Role: You need the LLM to act as a knowledgeable expert or a document writer.
Specify Style: Require the generated text to be consistent with the style of documents in the knowledge base (e.g., formal, objective language for medical guidelines; easy-to-understand marketing language for product descriptions).
Emphasize Details: Require the generated content to be detailed, specific, and logical, preferably including some details and examples.
Limit Length: Control the length of the generated document to avoid being too long (introducing noise) or too short (failing to bridge the gap). Typically, generate 1–3 paragraphs, about 100–500 tokens.

Example Prompt:

1
2
3

System Role: You are a professional medical health consultant.
User Question: How should hypertensive patients adjust their diet?
Task: Based on the user's question, write a detailed, objective document suitable for retrieval. This document should include specific dietary advice, reasons, and precautions. Use a professional, formal style.

Example Hypothetical Document (Hypo Doc) generated by LLM:

"Dietary adjustment for hypertensive patients is a key part of cardiovascular disease management. The primary principle is to limit sodium intake, with daily salt consumption controlled below 5 grams. It is recommended to increase intake of potassium, calcium, and magnesium, and eat more fresh vegetables, fruits, whole grains, and low-fat dairy products. At the same time, reduce intake of saturated and trans fats, and limit consumption of high-cholesterol foods. Monitor blood pressure regularly, and combine with regular physical exercise and weight control. For patients on long-term diuretics, attention should be paid to potassium supplementation."

Step 2: Embed — Reuse the Embedding Model, Treat Equally

Feed the generated hypothetical document into the same embedding model used for the knowledge base documents to compute its vector. This step is crucial because only by using the same encoder can we ensure that the hypothetical document vector and all document vectors reside in the same “language” vector space. This vector is hypo_vec.

Step 3: Retrieve — Search Using the Hypothetical Document Vector

Now, discard the original query vector (query_vec) and use hypo_vec to search the knowledge base for the most similar documents. Since hypo_vec‘s semantics are closer to real documents, its similarity score with the target document is usually much higher than with irrelevant documents.

Best Practice: HyDE is not just a “trick”; it is more like a paradigm shift in retrieval strategy. It does not change the underlying retrieval algorithm but changes the retrieval probe.

Comparison with Multi-Query:

Multi-Query: Generates multiple differently phrased queries, retrieves separately, then merges and deduplicates. It aims to solve the lexical gap by expanding query terms to cover more semantics.
HyDE: Generates one ideal document and performs a single retrieval using it. It aims to solve the granularity and intent gap by creating an “anchor” in the document space for precise targeting.

In some scenarios, the two can complement each other: first use Multi-Query to generate multiple queries, then apply HyDE to each query separately, and finally fuse the results. But HyDE alone can bring significant improvements and is simpler to implement.

Hands-on: Implementing a HyDE Retrieval Module Step by Step

All theory aside, there’s nothing like running an actual example. Below, I’ll guide you through building a complete HyDE retrieval module using Python, combining popular open-source LLMs and embedding models.

1. Environment Setup and Model Selection

Recommended Combination (Tools for LLM Retrieval Optimization):

LLM Generation Model: Qwen2.5-7B-Instruct. It performs excellently in both Chinese and English, with fast inference speed and moderate resource consumption. If resources are limited, you can use smaller models like Qwen2.5-1.5B-Instruct or ChatGLM3-6B.
Embedding Model: BAAI/bge-large-zh-v1.5.

This is one of the best-performing open-source embedding models for Chinese, with a vector dimension of 1024 and outstanding encoding of knowledge bases. You can also use models like moka-ai/m3e-base.

Inference Framework: vLLM. For LLM generation, using vLLM is strongly recommended for its fast inference speed and excellent concurrency. For simple demos, HuggingFace Transformers is also acceptable.
Vector Library: Faiss. It is one of the most popular vector retrieval libraries, supporting various index structures with excellent performance.

Lightweight Solution: If you have limited hardware (e.g., only 8GB VRAM), you can choose smaller model combinations or use online LLM APIs (e.g., OpenAI, ERNIE Bot, Tongyi Qianwen) to generate hypothetical documents.

# Example: Import basic libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Choose a simple general-purpose embedding model for demonstration
# For production, bge-large is recommended
embed_model = SentenceTransformer('all-MiniLM-L6-v2') 

# Choose a lightweight LLM for local testing
# For production, vLLM deployment of Qwen2.5-7B or API calls to larger models is recommended
llm_model_name = "Qwen/Qwen2.5-1.5B-Instruct"  # or other models
tokenizer = AutoTokenizer.from_pretrained(llm_model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    llm_model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
model.eval()

Note: The above code is only for verifying the pipeline. In a production environment with high concurrency, be sure to use vLLM or similar inference frameworks for LLM deployment, and dedicated vector libraries like Faiss.

2. Core Code: Generate Hypothetical Document → Retrieve → Output Results

Now, let’s focus on the core logic. The following code demonstrates the complete HyDE pipeline.

import torch
from sentence_transformers import SentenceTransformer
import numpy as np

# Assume you have already initialized embed_model and llm_model (as shown above)
# For demonstration, we use a simplified LLM call function

def generate_hypothetical_document(query):
    """
    Use LLM to generate a hypothetical document
    """
    prompt = f"""You are a professional medical health consultant.
User Question: {query}
Task: Based on the user's question, write a detailed, objective document suitable for retrieval.

This document should include specific advice, reasons, and precautions. Use a professional, formal style.
Output a paragraph of about 100-200 Chinese characters:"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7,   # temperature control, 0.7 balances creativity and relevance
            do_sample=True,
            top_p=0.9
        )
    # Decode output and strip whitespace
    hypo_doc = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Remove prompt part, keep only generated content
    # A more rigorous way is to directly get the generated part during generation, but we simplify here
    if prompt in hypo_doc:
        hypo_doc = hypo_doc.split(prompt)[1].strip()
    # If generation fails or is empty, fall back to the original query
    if not hypo_doc:
        hypo_doc = query
    return hypo_doc

# Example knowledge base
knowledge_base = {
    1: "Marathon runners need a high-carbohydrate diet and pay attention to electrolyte replenishment.",
    2: "Dietary advice for hypertensive patients: low salt, low fat, increase potassium and calcium intake, control weight.",
    3: "Diabetic patients should monitor carbohydrate intake and use insulin or oral medications on schedule.",
    4: "Consistent regular exercise is an effective way to control blood pressure and blood sugar.",
    5: "Elderly hypertensive patients should be aware of orthostatic hypotension and avoid standing up quickly."
}
doc_ids = list(knowledge_base.keys())
doc_texts = list(knowledge_base.values())

# 1. Use the embedding model to vectorize all documents in the knowledge base
doc_embeddings = embed_model.encode(doc_texts) # shape: [num_docs, embed_dim]

# 2. Build Faiss index (using the simplest L2 distance index)
index = faiss.IndexFlatL2(doc_embeddings.shape[1]) 
index.add(doc_embeddings.astype(np.float32))

# 3. User query
query = "How should hypertensive patients adjust their diet?"

# 4. Execute HyDE pipeline
print(f"Original query: {query}")

# Step 1: Generate hypothetical document
hypothetical_doc = generate_hypothetical_document(query)
print(f"Hypothetical document: {hypothetical_doc}")

# Step 2: Encode the hypothetical document and use it as the retrieval query
# Note: We are NOT using query_vector, but hypothetical_vector
hypothetical_vector = embed_model.encode([hypothetical_doc]).astype(np.float32)

# Step 3: Search using the hypothetical document vector
k = 1  # Return the most similar document
distances, indices = index.search(hypothetical_vector, k)

# Output results
best_idx = indices[0][0]
best_doc_id = doc_ids[best_idx]
best_doc_text = doc_texts[best_idx]
print(f"Retrieved document ID: {best_doc_id}")
print(f"Retrieved document content: {best_doc_text}")

# Comparison: retrieval using the original query only
query_vector = embed_model.encode([query]).astype(np.float32)
distances_raw, indices_raw = index.search(query_vector, k)
best_idx_raw = indices_raw[0][0]
print(f"\nDirect retrieval result: {doc_texts[best_idx_raw]}")

Step-by-step Explanation of the Code:

generate_hypothetical_document(query) function:
- It defines a prompt template that instructs the LLM to act as a “medical health consultant” and generate a detailed, professional document. Remember, the prompt determines the quality of the generated hypothetical document.
- Text is generated via model.generate(). max_new_tokens controls the length. Too long a hypothetical document introduces noise.
- Common Pitfall: After generation, clean the output to remove the prompt itself; otherwise, the vector will contain noise.
Building the knowledge base and Faiss index:
- We encode all documents in the knowledge base list doc_texts into a vector matrix doc_embeddings using the embedding model in one go.
- We use Faiss’s IndexFlatL2 index, a brute-force search index with maximum accuracy, but slower with tens of millions of vectors. For up to millions of vectors, it is sufficient.
Core HyDE Retrieval:
- Call generate_hypothetical_document to generate the hypothetical document.
- Call embed_model.encode on the hypothetical document to get hypothetical_vector.
- Use index.search(hypothetical_vector, k) to search the knowledge base with the hypothetical document vector.
- Output the result.
Result Comparison:
- Finally, we compare the result from retrieving using the original query vector query_vector. In this example, direct retrieval likely returns the “Marathon runners…” document because it contains the word “diet.” HyDE retrieval returns the correct “Dietary advice for hypertensive patients…”

Note: This example is simple but reveals the core idea of HyDE. In real projects, you need to fine-tune the prompt for generating the hypothetical document based on your knowledge base content.

Advanced Techniques: How to Make HyDE Even Better?

By now, you should be able to run a basic HyDE pipeline. However, to make it shine in real projects, you need some “internal skills” for RAG online retrieval optimization strategies.

1. Generation Quality Tuning: Instruction Engineering and Few-shot Examples

The quality of the hypothetical document is the cornerstone of HyDE. A poor-quality hypothetical document (e.g., containing incorrect information, inconsistent style, too general) will not only fail to improve retrieval but may introduce significant noise and degrade results.

Instruction engineering is a powerful tool for adjusting quality.

Control Style: Specify the document style explicitly in the prompt. For example: “Write a paragraph in the style of the Chinese Hypertension Prevention and Treatment Guidelines…”. If the knowledge base contains many tables and lists, try guiding the prompt to generate structured text.
Add Constraints: You can require the generated content to include certain key elements. For example: “Make sure your answer includes the three keywords ‘diet’, ‘exercise’, and ‘medication’.”
Use System Prompt: Define the expert role in the LLM’s system prompt, telling the model it must possess deep medical knowledge and “only output text relevant to the question and suitable for retrieval; do not add any explanations or discussions.”

Few-shot examples are a powerful method to enhance generation consistency.

Providing one or a few high-quality hypothetical documents as examples in the prompt can significantly improve the LLM’s ability to mimic. Example:

Based on the user's question, write a professional document suitable for retrieval.
Example 1:
User Question: How to prevent diabetic foot?
Hypothetical Document: The core of diabetic foot prevention lies in controlling blood sugar, regular foot examination, choosing appropriate shoes and socks, and avoiding foot injury. It is recommended to wash feet with warm water daily and dry thoroughly, use moisturizing cream to keep skin supple but avoid applying between toes. Once redness, swelling, or ulceration is found, seek medical attention immediately.
Example 2:
User Question: What are the harms of childhood obesity?
Hypothetical Document: Childhood obesity not only affects body shape but also increases the risk of hypertension, diabetes, sleep apnea syndrome, osteoarthritis, and other diseases. Meanwhile, obesity also has negative effects on children's psychological health. Therefore, parents should monitor their children's weight and cultivate healthy eating and exercise habits.
---
User Question: [Your query]
Hypothetical Document:

Best Practice: Prepare different few-shot examples for different knowledge base domains (e.g., medical, legal, financial). This is equivalent to training a dedicated “document generator” for each domain.

2. Hybrid Retrieval: HyDE + Sparse Retrieval for Complementarity

The hypothetical document vector produced by HyDE is a dense vector. It excels at capturing semantic similarity but is relatively weak in handling high-precision keyword matching (e.g., product codes, book titles). In contrast, sparse retrieval (e.g., BM25) is based on term frequency and inverse document frequency, excelling at precise keyword matching but insensitive to semantic similarity.

Combining the two can achieve complementary strengths. A common approach is: perform retrieval separately using the HyDE vector (dense) and the original query (sparse), then fuse the two result lists using RRF (Reciprocal Rank Fusion).

RRF Fusion Formula:
The final score of each document = 1/(60 + rank_hyde) + 1/(60 + rank_bm25), where rank_hyde and rank_bm25 are the ranks of that document in the HyDE and BM25 retrieval results respectively.

Implementation Steps:

Perform sparse retrieval using BM25 on the original query query, obtaining a top-k result list list_bm25 (containing document IDs and scores).
Perform dense retrieval using HyDE (i.e., hypothetical document vector), obtaining a top-k result list list_hyde.
Apply the RRF fusion formula to all documents in both lists to calculate the final score.
Sort by final score in descending order and take the top-n as the final result.

This hybrid scheme in query expansion techniques often yields more robust and accurate retrieval results than either method alone.

Common Pitfalls: Easily Overlooked Traps When Implementing HyDE

When applying HyDE to production environments, there are several common traps to watch out for.

1. The Generated Document “Doesn’t Make Sense” — What to Do?

This is the most frustrating problem. It mainly occurs due to two reasons:

Insufficient LLM capability: For very specialized or rare questions, even medium-sized models may not generate a meaningful hypothetical document.
Poorly designed prompt: The prompt does not clearly guide the LLM to focus on the question, or does not limit its output scope, causing the LLM to diverge.

Solutions:

Validate the plausibility of the generated document: Before formal retrieval, compute the cosine similarity between the hypothetical document vector hypo_vec and the original query vector query_vec. If the similarity is below a certain threshold (e.g., 0.5), it indicates the hypothetical document may have “gone off track.” In that case, fall back to using the original query for retrieval.
Use a larger LLM: If budget allows, using a more powerful model (e.g., Qwen2.5-72B, GPT-4) can significantly improve generation stability.
During testing, you can call these models via API to evaluate their effectiveness.
Add constraints in the prompt: Explicitly tell the LLM: “The document you generate must be directly related to the user’s question and must not contain any irrelevant information. If the question is unclear, assume the most likely scenario.”

2. Latency and Cost: Is HyDE Worth Introducing?

Although HyDE brings significant improvements in retrieval quality, it also introduces additional latency and cost.

Latency: The HyDE pipeline is: generate hypothetical document (LLM inference) → encode hypothetical document (embedding model) → retrieve. Compared to direct retrieval, it adds an LLM inference step and an extra encoding step. LLM inference is the main bottleneck.
Cost: If using an API to call the LLM, each HyDE step incurs one API call. For high-frequency scenarios, costs can quickly increase.

Mitigation Strategies:

Cache hypothetical documents: For frequently occurring queries or query templates, you can cache the generated hypothetical document and its vector. When encountering the same or highly similar query again, directly use the cached vector, skipping the LLM generation step. This is an efficient fallback strategy.
Async generation: Decouple the LLM generation step from the retrieval-reading step. First, perform a quick retrieval using the original query so users can see a preliminary answer quickly. Meanwhile, asynchronously execute HyDE to generate more accurate retrieval results and, once ready, overlay or supplement the preliminary answer. This can significantly improve user experience.
Selective application: Not all queries need HyDE. You can build a lightweight classifier (or simple rules) to determine the complexity and ambiguity of the query. For simple, clear queries, use the original query directly; for ambiguous, complex queries, trigger HyDE.
This is the prototype of “adaptive query expansion.”

Summary and Outlook: From HyDE to Smarter Online Retrieval

Let’s recap the core breakthrough of the HyDE hypothetical document embedding principle: It uses an elegant “generate-embed-retrieve” pipeline to transform the problem of mismatched granularity between user queries and knowledge base documents into a “document-document” matching problem. This “retreat to advance” strategy cleverly improves retrieval relevance, especially for complex, vague queries.

In this hands-on query expansion technique practice, we learned three things:

Understanding the principle: HyDE is a “star” in query expansion; it doesn’t rewrite the question but generates a “blueprint” of the answer.
Hands-on implementation: Using Qwen2.5-7B + bge-large, we can easily replicate the core HyDE pipeline in a local environment.
Fine-tuning: Instruction engineering, few-shot examples, and hybrid retrieval (HyDE + BM25) are important tools to maximize the effectiveness of HyDE.

Looking ahead, online retrieval technology for RAG is evolving towards smarter, more adaptive, and lower-cost directions.

Adaptive Query Expansion: Future systems will dynamically decide which retrieval strategy to use (direct retrieval, Multi-Query, HyDE, or their combination) based on query complexity, knowledge base characteristics, and user context. This requires smarter “strategy selectors.”
End-to-End Retrieval as a Service: We can view HyDE as a “retrieval enhancement function” that can be integrated into more complex RAG pipelines. Frameworks like vLLM and LangChain already provide implementations of HyDE, and it may become a standard component of every RAG system.
Integration with Knowledge Graphs: The hypothetical documents generated by HyDE can be used not only for dense retrieval but also for entity recognition and relation extraction in knowledge graphs, enabling deeper knowledge understanding.

HyDE is not a silver bullet, but it is undoubtedly one of the most effective and practical techniques currently available for improving RAG online retrieval performance. It tells us that sometimes “taking a detour” can get you to the destination faster than “going straight.” Now is the time to try HyDE in your projects. You’ll find that those frustrating “irrelevant answers” may no longer exist.

Summary

Through this article, you should have a deeper understanding of the “HyDE hypothetical document embedding principle.” I suggest practicing with real projects. If you have any questions, feel free to discuss!