1. Introduction: The “Blindness” Dilemma of Traditional RAG and the Multimodal Breakthrough

Have you ever encountered this scenario: you feed a PDF report full of charts, product images, and handwritten signatures into a traditional RAG system, only to get the reply “No relevant information found”? This isn’t the system being “lazy”; it’s because traditional RAG suffers from a fatal “blindness” defect—it can only index and understand plain text documents. When faced with visual elements like images, tables, and formulas, it’s like a person in the dark, completely helpless.

In enterprise applications, this defect is especially critical. A financial annual report might contain a dozen bar charts showing revenue changes, a medical diagnostic report might have key information hidden in CT images, and the core selling points of a product manual might all be in schematic diagrams. If the RAG system can only read text descriptions, then the information it misses often exceeds 50%. This means the RAG system you painstakingly built is essentially “fumbling in the dark” when facing these critical visual elements.

So, Multimodal RAG is a technical solution designed precisely to break this text boundary. Its core logic is simple: make the RAG system not only “read” text, but also “see” images, “recognize” charts, and “understand” formulas. By unifying unstructured visual data (like images and tables) with text data into a single retrieval-generation framework, multimodal RAG can achieve truly “text-and-image-rich” understanding and answering.

Specifically, multimodal RAG breaks through via two core paths: Text-Image Vector Retrieval and Visual Understanding + Hybrid Ranking. The former uses dual-tower models like CLIP to map text and images into the same high-dimensional semantic space, enabling “text-to-image search,” “image-to-image search,” and even “text-image hybrid search”; the latter introduces OCR (Optical Character Recognition) and VLM (Vision-Language Model) to allow the system to “read” text and semantics within images, and even understand the logical relationships behind complex charts.

Through this article, you will systematically master the complete knowledge system of multimodal RAG:

  • Principles: Thoroughly understand CLIP dual-tower models, RRF fusion retrieval, OCR+VLM collaboration mechanisms;
  • Practice: Gain reproducible Python code covering text-to-image search, image-to-image search, and enterprise-grade multimodal PDF RAG Pipeline;
  • Advanced Tips: Master RRF parameter tuning, domain-specific fine-tuning of multimodal embeddings, and a guide to avoiding pitfalls.

Whether you are a developer just starting with RAG or a senior engineer seeking to implement multimodal retrieval in your projects, this article will provide a clear roadmap and directly runnable code examples. Ready? Let’s break the boundaries of text together.

2. Core Principles of Multimodal RAG: From Single Modality to Text-Image Dual Channel

2.1 Two-Stage Architecture: Retrieval and Generation, Clear Division of Labor

The overall architecture of multimodal RAG can be divided into two major stages: Text-Image Hybrid Retrieval and Visually Augmented Generation. These two stages are not simply concatenated; they work together, forming a complete closed loop.

The retrieval stage aims to quickly find the most relevant candidate segments from massive text-image data based on the user query. Unlike pure text RAG, the retrieval objects here include not only text paragraphs but also images, charts, and even tables. The challenge is: how to compare data from different modalities (text, images) within the same retrieval space?

The task of the generation stage is more complex: take the retrieved multimodal segments (text + images) as context, combine them with the user’s query, and generate a natural language answer. This answer may not only contain text but also cite or describe the retrieved images. For example: “According to Figure 3, the company’s Q3 revenue increased by 15%, primarily due to the success of the new product line.”

2.2 Dual-Tower Models: How CLIP Achieves Text-Image Semantic Alignment

The core technology for implementing text-image hybrid retrieval is the dual-tower model, with OpenAI’s CLIP being a prime example. You can understand CLIP’s working principle through the concept of “bridging”: it uses two independent encoders—a text encoder (similar to BERT) and an image encoder (similar to ViT)—to map text and images into the same high-dimensional vector space.

1
2
3
4
Text Encoder:
"A red car" -> [0.12, 0.87, -0.45, ...]
Image Encoder:
Image -> [0.13, 0.86, -0.44, ...]

Through contrastive learning training on massive text-image pairs (e.g., 400 million web text-image pairs), CLIP learns a crucial ability: semantically similar text and images have vector representations close to each other in the high-dimensional space; semantically unrelated text and images have vectors far apart. This achieves text-image semantic “alignment”.

Note: CLIP’s alignment is based on the semantic level. It excels at understanding the “content” and “theme” of images but performs poorly on fine-grained object detection, spatial relationships, and text recognition within images (like phone numbers on a business card). This is why in practice, CLIP is often used in conjunction with OCR and VLM.

Suppose we have a knowledge base:

  • Image A: A red sports car driving on a road
  • Image B: A snowy mountain

User query: “A red car.” After CLIP encoding, the text vector for “A red car” will have a much higher cosine similarity with Image A’s vector than with Image B’s. Therefore, the system can accurately return Image A. This is the underlying principle of Text-to-Image Multimodal RAG Implementation.

2.3 Sparse Retrieval and RRF Fusion: Why Pure Vector Retrieval Isn’t Enough

Although vector retrieval (dense retrieval) performs well in multimodal retrieval, it’s not a silver bullet. Especially when the query contains precise entity names, numbers, or dates, pure vector retrieval often performs poorly. For example, a query like “Table 3 on page 12 of the 2023 financial report” might cause the vector model to retrieve completely irrelevant content because it cannot understand positional information like “page 12.”

This is where sparse retrieval (e.g., BM25) comes into play. BM25 matches based on term frequency-inverse document frequency, making it very sensitive to exact matches. However, traditional BM25 can only handle text, not images directly. So what’s the solution?

The answer is: use both vector retrieval and sparse retrieval, then fuse them using RRF (Reciprocal Rank Fusion). The specific steps are:

  1. Vector Retrieval: Use CLIP to encode the user query and images, compute similarity, and get the top-N candidate results.
  2. Sparse Retrieval: Perform OCR on images to extract text (or use text descriptions of images), then run BM25 retrieval on these texts to get top-N candidate results.
  3. RRF Fusion: For each candidate document in both ranking results, compute its RRF score: score = ∑ 1/(rank + k). Here, rank is the document’s position in a specific retrieval result, and k is an empirical constant (usually set to 60). Finally, sort by total RRF score.

This RRF Fusion Retrieval Multimodal RAG scheme compensates for the shortcomings of a single retrieval method. Vector retrieval ensures “semantic relevance,” while BM25 ensures “exact matching.” After fusion, retrieval robustness is greatly improved.

Tip: In multimodal RAG, we usually perform BM25 only on the “text description” of images, not directly on the images themselves, because images don’t contain text. This text description can come from OCR-recognized text or VLM-generated semantic labels for images.

3. Comparison of Four Typical Text-Image Retrieval Schemes: Text-to-Image, Image-to-Image, Hybrid Retrieval, End-to-End RAG

Now that we understand the principles, let’s look at the specific implementation paths for multimodal RAG image-text retrieval in real-world development. Depending on business requirements, you can choose different schemes.

3.1 Text-to-Image Retrieval: Text Query Returns Images

This is the most common and basic scenario. The user enters a text description, and the system returns the best matching images from the image library.

Core Workflow:

  1. Offline Phase: Use an image encoder (e.g., CLIP’s Image Encoder) to encode all images, storing the image vectors in a vector database (e.g., Faiss, Milvus).
  2. Online Phase: The user inputs a text query, which is encoded into a vector using the text encoder (e.g., CLIP’s Text Encoder). Then perform ANN (Approximate Nearest Neighbor) search in the vector database.
  3. Output: Return the top-K images with the highest similarity.

Use Cases: Product image search, material library search by topic, creative design reference.

3.2 Image-to-Image Retrieval: Finding Similar Visual Elements

Core Workflow:

  1. Offline Phase: Same as text-to-image; encode all images and store them in the database.
  2. Online Phase: The user provides a reference image, which is encoded into a vector using the image encoder. Then search the vector library for the most similar images.
  3. Output: Return visually closest images.

Use Cases: Image-based product search (e-commerce), finding similar materials, image deduplication, intellectual property protection.

3.3 Hybrid Retrieval: Unified Retrieval with Combined Text and Image Features

This is a more advanced retrieval method. The user’s query may contain both text and images. For example, a user uploads an image of a “beach” and types “add people by the sea.” The system needs to understand the combined semantics of both “beach” and “people by the sea.”

Core Workflow: Use a multimodal model (like CLIP or a more powerful multimodal Encoder) to simultaneously encode the input text and image, producing a fused query vector, then search the database.

Use Cases: Complex image descriptions, fine-grained search with example images and text descriptions.

3.4 End-to-End Multimodal RAG: Retrieve Then Generate, Output Text Answers with Image References

This is the solution closest to intelligent applications. After receiving a hybrid text-image query, the system performs the aforementioned hybrid retrieval. Instead of directly returning a list of images, it passes the visual information as context to a VLM (Visual Language Model) or API (e.g., GPT-4o) to generate a descriptive text answer that can cite the retrieved images.

Enterprise Multimodal RAG System in Practice: For example, a user asks, “Which quarter last year had the highest gross margin? Please show the relevant chart.” The system retrieves the corresponding quarterly financial report image (chart) and then calls the VLM to generate the answer: “According to Figure X, Q3 had the highest gross margin at 45.3%.” The output includes Figure X.

Selection Criteria:

  • If your task is directly retrieving images from an image library, text-to-image/image-to-image is sufficient—use Multimodal RAG Text-Image Retrieval Scheme.
  • If your task is answering a complex question that requires understanding image content and making inferences, then End-to-End Multimodal RAG is more appropriate. You need to combine OCR and VLM Image Semantic Recognition.

4. Practice 1: Building a Text-to-Image and Image-to-Image Retrieval System Based on CLIP (with Code)

Next, we dive into practice. We’ll use the CLIP model from the HuggingFace Transformers library and the open-source library Faiss to build a complete text-to-image/image-to-image retrieval system. The code will be explained line by line to ensure you can reproduce it directly.

4.1 Environment Setup

1
2
3
4
5
# Install necessary libraries
pip install torch torchvision
pip install transformers
pip install faiss-cpu # Use faiss-gpu for GPU environments
pip install Pillow

4.2 Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
import numpy as np
import faiss
import os

# ==========================================
# Part 1: Initialize Model and Processor
# ==========================================
# Load CLIP model (base version, ~300MB)
model_name = "openai/clip-vit-base-patch32" # Can also use "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

# Helper function: Encode a single image
def encode_image(image_path):
"""
Convert a single image to a vector
:param image_path: Path to the image file
"""
try:
image = Image.open(image_path).convert("RGB") # Ensure RGB mode
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
image_features = model.get_image_features(**inputs)
# Normalize: crucial for cosine similarity calculation
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
return image_features.cpu().numpy().flatten()
except Exception as e:
print(f"Error processing image {image_path}: {e}")
return None

# Helper function: Encode text query
def encode_query(query_text):
"""
Convert a text query to a vector
"""
inputs = processor(text=[query_text], return_tensors="pt", padding=True)
with torch.no_grad():
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
return text_features.cpu().numpy().flatten()

# ==========================================
# Part 2: Build Image Vector Library (using Faiss)
# ==========================================
# Assume we have an image directory
image_dir = "./my_images" # Replace with your image directory
image_paths = [os.path.join(image_dir, f) for f in os.listdir(image_dir) if f.endswith(('.jpg', '.png', '.jpeg'))]

print(f"Found {len(image_paths)} images")

# Precompute vectors for all images and store in Faiss index
dimension = 512 # CLIP ViT-B/32 vector dimension; 768 for large version
index = faiss.IndexFlatIP(dimension) # Use inner product index (equivalent to cosine similarity since normalized)

image_vectors = []
valid_image_paths = []
for img_path in image_paths:
vec = encode_image(img_path)
if vec is not None:
image_vectors.append(vec)
valid_image_paths.append(img_path)

# Stack vectors into a matrix and add to index
if image_vectors:
image_vectors_np = np.array(image_vectors).astype('float32')
index.add(image_vectors_np)
print(f"Successfully added {index.ntotal} vectors to index")
else:
print("Warning: No valid image vectors added to index")
exit()

# ==========================================
# Part 3: Perform Retrieval
# ==========================================
def search_by_text(query_text, top_k=5):
"""
Text-to-Image Search
:param query_text: User input text description
:param top_k: Return top-k matching images
"""
query_vec = encode_query(query_text).astype('float32').reshape(1, -1)
distances, indices = index.search(query_vec, top_k)

results = []
for i, idx in enumerate(indices[0]):
if idx != -1 and idx < len(valid_image_paths):
results.append((valid_image_paths[idx], distances[0][i]))
return results

def search_by_image(query_image_path, top_k=5):
"""
Image-to-Image Search
"""
query_vec = encode_image(query_image_path)
if query_vec is None:
return []
query_vec = query_vec.astype('float32').reshape(1, -1)
distances, indices = index.search(query_vec, top_k)

results = []
for i, idx in enumerate(indices[0]):
if idx != -1 and idx < len(valid_image_paths):
results.append((valid_image_paths[idx], distances[0][i]))
return results

# ==========================================
# Part 4: Test Retrieval Effectiveness (Example)
# ==========================================
if __name__ == "__main__":
# Test text-to-image search
query = "A forest in autumn" # Replace with your query
print(f"\nText-to-Image Query: '{query}'")
results = search_by_text(query, top_k=3)
for path, score in results:
print(f" Image: {path}, Similarity: {score:.4f}")

# Test image-to-image search (assume a reference image exists)
ref_image = "path/to/reference.jpg" # Replace with reference image path
if os.path.exists(ref_image):
print(f"\nImage-to-Image Query (reference image: {ref_image})")
results = search_by_image(ref_image, top_k=3)
for path, score in results:
print(f" Image: {path}, Similarity: {score:.4f}")

4.3 Key Points and Optimization Ideas

  • Normalization: Normalization in step 7 is crucial. Without it, Faiss’s inner product index calculates dot product results, which cannot directly measure similarity. After normalization, dot product is equivalent to cosine similarity.
  • batch_size: If processing a large number of images offline, you can set a batch_size parameter to avoid memory overflow. processor(images=image_list, return_tensors="pt", padding=True) can process multiple images at once.
  • Similarity Threshold: In real applications, it’s recommended to set a similarity threshold (e.g., 0.7) to filter out images below that threshold. This effectively reduces interference from irrelevant results.

Best Practice: When your image library exceeds 100,000 images, Faiss’s IndexFlatIP (brute force search) becomes slow. Consider upgrading to more advanced index structures like IndexIVFFlat (Inverted File Index) or IndexHNSWFlat (Hierarchical Navigable Small World graph). These indexes significantly improve retrieval speed but may slightly reduce recall.

5. Practice 2: OCR + VLM for Multimodal PDF RAG Pipeline (Enterprise Scenario)

The CLIP retrieval system from Section 4 can only handle pure image scenarios. In enterprise applications, we face PDF documents containing mixed text and images. Here, we need to combine OCR (Optical Character Recognition) and VLM (Vision-Language Model) to achieve true multimodal RAG.

5.1 Overall Architecture Design

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
PDF Document
|
+-- Text extraction (using PyMuPDF, etc.)
|
+-- Image extraction (parse image blocks from PDF)
| |
| +-- OCR extracts text from images (PaddleOCR)
| |
| +-- VLM generates semantic labels for images (Qwen-VL / InternVL)
|
+-- Build unified vector database
| |
| +-- Text chunk embedding
| +-- (Image OCR text + Image semantic label) combined into "image text summary" then embedding
|
+-- Retrieval stage (RRF fusion)
|
+-- Generation stage: LLM generates answers with image references

5.2 Extracting Image Blocks from PDF and Processing with OCR/VLM

Due to space constraints, we provide core ideas and pseudocode snippets. For actual implementation, refer to the specific library documentation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# Note: The following code is pseudocode; implementation depends on specific libraries
import fitz # PyMuPDF, for PDF operations
from paddleocr import PaddleOCR
from transformers import AutoModelForCausalLM # Example: Qwen-VL

# 1. Extract images from PDF
def extract_images_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
images_info = []
for page_num, page in enumerate(doc):
# Get all image references on the page
image_list = page.get_images(full=True)
for img_index, img in enumerate(image_list):
xref = img[0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
image_ext = base_image["ext"]
image_path = f"page_{page_num}_img_{img_index}.{image_ext}"
with open(image_path, "wb") as f:
f.write(image_bytes)
images_info.append({
"image_path": image_path,
"page_num": page_num,
"position_on_page": img # Record image position on page (bbox, etc.)
})
return images_info

# 2. Process images with OCR and VLM
def process_image_with_ocr_vlm(image_path, vlm_model):
"""
Use OCR to extract text from the image, use VLM to generate semantic labels for the image
"""
# Initialize OCR engine (PaddleOCR)
ocr = PaddleOCR(use_angle_cls=True, lang='ch')
result = ocr.ocr(image_path, cls=True)

# Extract OCR text
ocr_text = ""
for line in result:
for word_info in line:
ocr_text += word_info[1][0] + " "
# OCR text can be used as the image's "text summary"

# Use VLM to generate semantic labels
# Assume vlm_model is a model that can accept an image and return a description
image = Image.open(image_path)
prompt = "Describe the main content of this image in one sentence"
vlm_description = vlm_model.generate(prompt, image=image)
# vlm_description e.g.: "A red car"

# Combine OCR text and VLM description as the image's "final text representation"
# Note: Both OCR text and VLM description are text, usable for BM25 retrieval
combined_text = f"Image location: Page {page_num}\nImage content description: {vlm_description}\nText in image: {ocr_text}"

return {
"ocr_text": ocr_text,
"vlm_description": vlm_description,
"combined_text": combined_text
}

# 3. Build unified index
# Text chunks -> embedding
# Image combined_text -> embedding
# Use the same embedding model to encode and store in Faiss
# ... (refer to Section 4 for specific code)

# 4. Retrieval stage: RRF fusion
# ... (omitted; refer to Section 2 principles)

# 5. Generation stage: LLM generates answers
def generate_answer_with_images(query, retrieved_results, llm):
"""
retrieved_results contains text fragments and image information
"""
# Build context including image references
context = ""
for result in retrieved_results:
if result.type == "text":
context += f"Text extracted from document page {result.page}: {result.content}\n"
elif result.type == "image":
context += f"Image extracted from document page {result.page}, described as: {result.combined_text}\n"
# Key: Mark image existence in context so LLM can reference later
context += f"[Image reference: {result.image_path}]"

# Let LLM generate answer, expecting it to cite images
prompt = f"""Below is the relevant context retrieved from the knowledge base.

Please answer the user's question based on the context.
If there are images in the context, reference them in your answer, e.g., "As shown in Figure X".

Context:
{context}

User question: {query}

Answer:"""
response = llm.generate(prompt)
return response

5.3 Key Considerations

  • Image Location Recording: When extracting images, you must record their exact location in the document (page number, region). This way, when generating the answer, you know which page and which image it refers to.
  • OCR vs VLM Division of Labor: OCR excels at extracting exact text from images (e.g., table numbers, nameplates, formulas), while VLM excels at semantic understanding (e.g., “This is a bar chart showing the company’s Q3 revenue”). Combining the two complements each other.
  • Enterprise Multimodal RAG System in Practice: For performance reasons, preprocess images. For example, resize large images to at most 512x512 pixels; otherwise, VLM processing will be very slow. Also, cache VLM generation results to avoid repeated calls.

6. Advanced Tips: RRF Fusion Retrieval Optimization and Multimodal Embedding Fine-Tuning

6.1 Choosing the k Value in RRF

The choice of k in RRF significantly impacts retrieval results. The larger the k value, the more balanced the weights from different ranking sources, and the less the fusion result is affected by low-frequency scores. The smaller the k value, the greater the weight of high rankings, making the fusion result more biased towards a single ranking source.

The empirical recommendation is k=60. However, in multimodal RAG scenarios, since the score scales of vector retrieval (CLIP) and sparse retrieval (BM25 over OCR text) may differ greatly, it’s advisable to adjust the k value via grid search (e.g., try k=30, 60, 100) and evaluate recall on a validation set. A “trick”: if your OCR text is very short (a few images), BM25 score range is narrow; you can increase the k value slightly (e.g., k=100) to let vector retrieval have a bit more prominence.

6.2 Domain-Specific Fine-Tuning of Multimodal Embeddings

The generic CLIP model may perform poorly in specific domains (e.g., medical imaging, technical drawings) because CLIP’s training data mainly consists of natural scenes and everyday objects. To improve retrieval recall in specialized fields, we can fine-tune CLIP, known as Multimodal Embedding Fine-Tuning.

Recommended Method: LoRA Fine-Tuning
Since the CLIP model is relatively large (~300M parameters), full fine-tuning is costly. LoRA is an efficient fine-tuning method that freezes the original model and inserts a small number of trainable low-rank matrices into the model layers. You can use LoRA to fine-tune either the image encoder or the text encoder of CLIP.

Fine-Tuning Data Preparation:

  • Collect domain-specific text-image pairs. For example, for the medical domain, collect X-ray images + diagnosis report descriptions; for the industrial domain, collect part drawings + technical parameter descriptions.
  • At least a few hundred pairs are needed.

Fine-Tuning Steps (Simplified):

  1. Load the pre-trained CLIP model.
  2. Use a LoRA library (e.g., PEFT) to configure LoRA (r=8, alpha=16).
  3. Fine-tune using a contrastive loss function, optimizing to maximize similarity between correct text-image pairs and minimize similarity with all negative pairs.
  4. Save the fine-tuned LoRA weights, loading them during inference.

6.3 Introducing Multimodal Knowledge Graph RAG

If you want to further enhance the reasoning capability of RAG, consider Multimodal Knowledge Graph RAG. The idea is as follows:

  1. Extract entities (e.g., “Company A”, “Product B”) and relationships (e.g., “produces”) from multimodal documents.
  2. Store the identified entities (possibly from OCR text or VLM) in a graph database (e.g., Neo4j).
  3. During retrieval, perform multi-hop reasoning through the knowledge graph to find deeply related document nodes.
  4. Finally, rank these nodes as candidate results.

This approach allows the RAG system to understand complex logical relationships, not just semantic similarity.

7. Pitfalls to Avoid: 10 Common Traps When Deploying Multimodal RAG

  1. Insufficient Image Resolution: OCR accuracy drops sharply for low-resolution images (<150 DPI). Preprocess by upscaling PDF scans or using super-resolution models.

  2. Table Recognition Requires Separate Models: Standard OCR struggles to handle table structure correctly. Use specialized table recognition models (e.g., TableMASTER, CascadeTabNet) to extract structured table data.

  3. Large Image Memory Overflow: VLM processing of high-resolution large images can easily cause OOM. Standard practice is to preprocess input images so the longest side does not exceed 512 pixels, maintaining the aspect ratio.

  4. CLIP Struggles with Artistic Fonts: CLIP’s text encoder is not sensitive to non-standard fonts (e.g., handwriting, artistic fonts). For stylized text, first use OCR to convert to standard text before inputting.

  5. VLM Inference Latency is High: Multimodal large models are usually large and slow to infer.

    Use inference frameworks (e.g., vLLM) or quantized models (e.g., AWQ, GPTQ).

  6. Text-Image Alignment Errors: In multi-page PDFs, images can be split across two pages, causing retrieved images to mismatch with text descriptions. Always record the exact position (page number, bbox) of each image in the original document.

  7. OCR Text Unrelated to Image Content: Some images (e.g., decorative backgrounds) may produce meaningless OCR strings.

    Set filtering rules, e.g., only keep images with more than 3 lines of text.

  8. Mismatched Vector Dimensions: After fine-tuning, do not change the output vector dimension of CLIP, or it will mismatch the Faiss index dimension.

  9. Evaluation Issues: Traditional Recall@k metrics cannot measure the accuracy of generated answers that include images. Use EvalScope or multimodal QA pairs to evaluate whether the generated answer cites the correct images and whether the text description matches the image.

  10. Multilingual Support: PaddleOCR supports Chinese, but CLIP may have poorer understanding of non-English text. For multilingual support, consider multilingual CLIP versions or AltCLIP.

8. Summary and Outlook: From Single Document to the Future of Multimodal Agents

8.1 Key Points Recap

Through this article, we have fully covered the complete process of multimodal RAG from principles to practice. The core can be summarized in one formula:

Multimodal RAG = CLIP/Multimodal Embedding + RRF Fusion + VLM Generation

  1. Text-Image Hybrid Retrieval: Relies on CLIP dual-tower models to achieve text-to-image and image-to-image search, solving the “semantic alignment” problem.
  2. RRF Fusion: Compensates for vector retrieval’s inability to handle exact matches by fusing BM25, improving robustness for queries with precise vocabulary.
  3. Visually Augmented Generation: Through OCR to extract text from images and VLM to generate semantic labels, the LLM can understand and reference image content.

8.2 Future Directions

  1. Multimodal Agents: Future RAG systems will not be passive responders but active “observers.” The model can autonomously decide when to call image-to-image search to “find similar images,” when to call text-to-image search to “find semantically relevant paragraphs,” or even call a camera to “look at the real world” when uncertain. This will be the fusion of multimodal RAG and agents.
  2. End-to-End Multimodal Native Models: Models like GPT-4o and Gemini have broken the “retrieve then generate” paradigm; they can directly understand multimodal inputs and even perform multimodal reasoning. In the future, RAG systems may no longer need an explicit retrieval stage but instead allow the model to directly “recall” or “reason” from knowledge.
  3. Multimodal Reasoning with Knowledge Graphs: As mentioned in Section 6, incorporating multimodal knowledge graphs into RAG allows the system to understand complex logical chains. For example: from an image of “Product A” → identify “Part B” → find “Supplier C of Part B” from the knowledge graph → generate the answer “This part is supplied by Supplier C.”

8.3 Next Steps

Don’t stay at the theoretical level. I suggest you get hands-on immediately with the following specific tasks to solidify your learning:

  1. Find a PDF annual report, PPT, or product manual with mixed text and images.
  2. Use PaddleOCR and Qwen-VL (or GPT-4o) to construct text/semantic labels for images.
  3. Embed both text and “image labels” uniformly and store in Faiss.
  4. Ask the system a question that requires referencing a chart to answer, and observe whether the system correctly retrieves and generates an answer with citations.

When you actually implement your first multimodal RAG Pipeline, you’ll find it’s not just a technical improvement, but a breakthrough in the dimension of “knowledge” itself. Good luck!