Bid Farewell to Retrieval Hallucinations! Step-by-Step Guide to Building an Enterprise-Grade RAG Data Pipeline (with One-Click Docker Deployment)

Pain Point: Using Generic Embedding Models for RAG, Getting Irrelevant Answers in Vertical Domains?

Have you ever encountered this:

Using OpenAI’s text-embedding-3-small for medical document retrieval, asking “what are the complications of diabetes,” getting “symptoms of a cold”
Using a generic BGE model for legal contracts, it confuses “liability for breach of contract” with “force majeure”
Spent days setting up LangChain + ChromaDB, but when you throw real business data into it, the results are disastrous

The core reason isn’t that the model isn’t good enough, but that your RAG data pipeline isn’t built correctly.

A production-grade RAG system goes far beyond the three steps of “split text → vectorize → store in vector database.” You need:

Multi-format file extraction (PDF/Word/PPT/Excel/HTML…)
Intelligent chunking strategies (not just cutting by word count)
Domain-specific Embedding model (BGE-M3 deployed locally)
Dual-write storage architecture (MySQL metadata + Milvus vector database)
Special table handling (the biggest blind spot of traditional solutions)

By the end of this article, you will have:

✅ A complete, runnable RAG data processing Pipeline
✅ Docker Compose one-click startup of the full-stack environment
✅ Automatic extraction capability for 9 file formats
✅ BGE-M3 1024-dimension local vectorization
✅ Triple storage dual-write: Milvus + MySQL + MinIO

Pre-flight Checklist

Before you start, confirm your environment meets the following requirements:

Hardware Requirements

Component	Minimum	Recommended	Notes
CPU	4 cores	8+ cores	Required for PaddleOCR and BGE-M3 inference
RAM	16 GB	32+ GB	PaddleOCR-VL loading requires ~4-6GB
GPU VRAM	None (CPU can run)	≥12 GB	Needed for BGE-M3 fine-tuning, inference can run on CPU
Disk	20 GB free	50+ GB	Milvus data + model files + raw files

💡 Tip: No GPU? No problem. This project supports CPU mode for BGE-M3 inference, just slower. If you only have CPU, set EMBEDDING_DEVICE to "cpu".

Software Environment

Software	Version Requirement	Validation Command
Python	3.10.x	`python --version`
Docker	≥24.0	`docker --version`
Docker Compose	≥2.20	`docker compose version`
Git	Latest	`git --version`

# Quick environment check
echo "=== Python ===" && python --version
echo "=== Docker ===" && docker --version
echo "=== Docker Compose ===" && docker compose version

Architecture Overview

First, check the overall architecture to understand what we’re building:

RAG Data Processing Pipeline Architecture

Figure 1: Enterprise RAG Offline Data Processing Architecture

See internal article: “RAG Offline: Multi-source Heterogeneous Data Cleaning and Deduplication Strategies” — cleaning, deduplication, and quality gates before ingestion

Step 1: Docker One-Click Startup of Full-Stack Environment

We use Docker Compose to start three storage services simultaneously:

docker-compose.yml

version: '3.8'

services:
  milvus:
    image: milvusdb/milvus:v2.4-latest
    container_name: rag-milvus
    ports:
      - "19530:19530"
      - "9091:9091"
    environment:
      ETCD_USE_EMBED: "true"
      ETCD_DATA_DIR: "/var/lib/milvus/etcd"
      COMMON_STORAGETYPE: "local"
    volumes:
      - milvus_data:/var/lib/milvus
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 30s
      timeout: 10s
      retries: 3

  mysql:
    image: mysql:8.0
    container_name: rag-mysql
    ports:
      - "3306:3306"
    environment:
      MYSQL_ROOT_PASSWORD: "rag_password_2024"
      MYSQL_DATABASE: "rag_base_multimodal"
      MYSQL_CHARSET: "utf8mb4"
    command:
      - --character-set-server=utf8mb4
      - --collation-server=utf8mb4_unicode_ci
      - --max_connections=500
      - --innodb_buffer_pool_size=512M
    volumes:
      - mysql_data:/var/lib/mysql
    healthcheck:
      test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-proot"]
      interval: 30s
      timeout: 10s
      retries: 3

  minio:
    image: minio/minio:latest
    container_name: rag-minio
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: "rag_minio_admin"
      MINIO_ROOT_PASSWORD: "rag_minio_password_2024"
    command: server /data --console-address ":9001"
    volumes:
      - minio_data:/data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  milvus_data:
  mysql_data:
  minio_data:

Start Services

# Navigate to project directory
cd /path/to/rag-base-data

# Start all services in background
docker compose up -d

# Check service status (wait for healthy state)
docker compose ps

# View logs (for troubleshooting)
docker compose logs -f milvus
docker compose logs -f mysql

Expected output (healthy indicates readiness):

NAME         IMAGE                    STATUS                    PORTS
rag-milvus   milvusdb/milvus:v2.4     Up Healthy (healthy)       0.0.0.0:19530->19530/tcp, 0.0.0.0:9091->9091/tcp
rag-mysql    mysql:8.0                Up Healthy (healthy)       0.0.0.0:3306->3306/tcp
rag-minio    minio/minio:latest       Up Healthy (healthy)       0.0.0.0:9000->9000/tcp, 0.0.0.0:9001->9001/tcp

Step 2: Clone Project and Install Dependencies

# Clone the project
cd rag-base-data

# Create virtual environment (Python 3.10 recommended)
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows

# Upgrade pip
pip install --upgrade pip

# Install dependencies
pip install -r requirements.txt

Core Dependencies

Package	Version	Purpose
`llama-index`	≥0.11.0	RAG framework core
`pymilvus`	≥2.3.0	Milvus vector database client
`paddleocr[doc-parser]`	≥3.3.0	PDF OCR engine (image-based)
`PyMuPDF`	≥1.23.0	Fast PDF text extraction
`pdfplumber`	≥0.10.0	Precise PDF table extraction
`sqlalchemy`	≥2.0.0	MySQL ORM
`pymysql`	≥1.1.0	MySQL driver
`sentence-transformers`	≥2.7.0	BGE-M3 Embedding model

⚠️ Note: The PaddleOCR package is large (~500MB) and installation may take a while. If you don’t need PDF OCR functionality, you can remove paddleocr-related dependencies from requirements.txt.

Step 3: Configure Environment Variables

The project uses Pydantic Settings for configuration, supporting environment variable overrides. Create a .env file:

# .env file (place in project root)

# Milvus configuration
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION_NAME=rag_documents
MILVUS_DIMENSION=1024

# MySQL configuration
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=root
MYSQL_PASSWORD=rag_password_2024
MYSQL_DATABASE=rag_base_multimodal

# MinIO configuration
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=rag_minio_admin
MINIO_SECRET_KEY=rag_minio_password_2024
MINIO_BUCKET=knowledge-bucket

# Embedding configuration
EMBEDDING_MODEL_PATH=./data_base/llm_models/bge-m3
EMBEDDING_DEVICE=cpu

See internal article: “RAG in Production: Deployment and Performance Monitoring Practices” — monitoring and capacity planning after Pipeline goes live

Step 4: Download BGE-M3 Model

BGE-M3 is a multilingual embedding model developed by BAAI, supporting three retrieval modes: Dense, Sparse, and ColBERT.

# Method 1: Use HuggingFace CLI to download (recommended)
pip install huggingface_hub
huggingface-cli download BAAI/bge-m3 \
    --local-dir ./data_base/llm_models/bge-m3

# Method 2: Use git lfs
git lfs install
git clone https://huggingface.co/BAAI/bge-m3 ./data_base/llm_models/bge-m3

After download, the model directory structure looks like:

data_base/llm_models/bge-m3/
├── config.json              # Model configuration (hidden layer dimensions, etc.)
├── pytorch_model.bin        # Model weights (~2.2GB)
├── tokenizer.json           # Tokenizer
├── tokenizer_config.json
├── special_tokens_map.json
├── sentence_bert_config.json # Sentence-Transformers configuration
├── modules.json             # Module definitions
├── 1_Pooling/
│   └── config.json          # Pooling layer configuration (mean pooling)
├── colbert_linear.pt        # ColBERT linear layer weights
└── sparse_linear.pt         # Sparse weights

💡 Why BGE-M3?

Supports 100+ languages (excellent Chinese performance)

Outputs 1024-dimensional vectors (high precision)

Maximum context length of 8192 tokens (friendly for long documents)

Top-tier performance on MTEB multilingual leaderboard

Step 5: Throw a PDF In and See the Full Data Flow

Now let’s run it for real! Prepare a test PDF file:

# Copy test file to data directory
cp your-test-file.pdf data_base/data/raw/

# Run the full Pipeline (dual-write ingestion mode)
python main.py \
    --file data_base/data/raw/test001.pdf \
    --ingest \
    --business-tag "AI Technology Foundation" \
    --chunk-profile default

CLI Parameter Description

Parameter	Description	Example
`--file`	Single file path	`/path/to/doc.pdf`
`--dir`	Directory path (batch)	`/path/to/data/`
`--ingest`	Enable dual-write ingestion mode	(flag)
`--business-tag`	Business tag (data isolation)	`"Medical Knowledge Base"`
`--chunk-profile`	Chunking strategy	`default` / `source_first` / `precision`
`--force-vl`	Force PDF to use OCR	(flag)
`--extract-only`	Extract only, no ingestion	(flag)

Sample Run Log Output

2026-04-23 00:30:35 | INFO  | Log system initialized | Log directory: .../logs
2026-04-23 00:30:36 | INFO  | LlamaRAGPipeline initialized
2026-04-23 00:30:36 | INFO  | CustomRAGLoader initialized: force_vl=False
2026-04-23 00:30:37 | INFO  | Extracting file: test001.pdf (format: PDF)
2026-04-23 00:30:38 | INFO  | PDF type detected: text_pdf (using PyMuPDF fast mode)
2026-04-23 00:30:39 | INFO  | Extraction complete: 3 text blocks, 2 tables, 1 image
2026-04-23 00:30:40 | INFO  | TableParseTransform: 2 tables validated successfully
2026-04-23 00:30:40 | INFO  | SymbolMapTransform: Replaced 5 symbols (±,≥,≤...)
2026-04-23 00:30:41 | INFO  | ChunkRouter initialized: profile=default, default strategy=parent_child
2026-04-23 00:30:42 | INFO  | Chunk routing completed: 6 documents → 28 chunks (parent=5, child=18, table=5)
2026-04-23 00:30:43 | INFO  | Embedding model built: ./data_base/llm_models/bge-m3
2026-04-23 00:31:05 | INFO  | Milvus inserted 28 vectors
2026-04-23 00:31:06 | INFO  | MySQL wrote 28 records
2026-04-23 00:31:06 | INFO  | Execution result: {"status": "success", "mysql_count": 28, "milvus_count": 28}

From the log, you can see the complete processing flow:

Extraction Phase: PDF identified as text-based → PyMuPDF extraction → got 3 text + 2 tables + 1 image
Cleaning Phase: Table validation passed + symbol mapping replacement
Chunking Phase: Parent-child chunking strategy → 6 documents became 28 chunks (5 parent + 18 child + 5 table)
Ingestion Phase: BGE-M3 vectorization → 28 vectors written to Milvus + 28 records written to MySQL

Step 6: Verify Milvus Vector Writes

After successful writing, let’s verify that vector retrieval works correctly:

from data_base.storage.milvus_store import milvus_store
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Initialize
milvus_store.init()
milvus_store.create_collection()

# Build Embedding model
embed_model = HuggingFaceEmbedding(
    model_name="./data_base/llm_models/bge-m3",
    device="cpu",
)

# Generate query vector
query = "How does the attention mechanism in Transformers work?"
query_vector = embed_model.get_query_embedding(query)

# Execute search
results = milvus_store.search(
    query_vector=query_vector,
    limit=5,
    business_tag="AI Technology Foundation",
)

# Print results
for i, r in enumerate(results):
    print(f"[{i+1}] score={r['distance']:.4f} | doc_id={r['doc_id']}")

Expected output:

[1] score=0.8923 | doc_id=1723847192837647001
[2] score=0.8541 | doc_id=1723847192837647002
[3] score=0.8217 | doc_id=1723847192837647003
[4] score=0.7892 | doc_id=1723847192837647004
[5] score=0.7534 | doc_id=1723847192837647005

🔑 The closer the score is to 1, the higher the similarity. Milvus uses COSINE (cosine similarity), range [-1, 1]. For similar texts, it’s typically in the [0.7, 1.0] range.

See internal article: “RAG Evaluation: Full-chain Metrics Design and Effectiveness Assessment System” — how to measure whether retrieval and generation are “hallucinating”

Core Code Explanation

Main Entry: main.py

The project entry is very clean, using argparse to provide a CLI interface:

import argparse
from data_base.extensions.llama_pipeline import LlamaRAGPipeline

def main():
    parser = argparse.ArgumentParser(description="RAG Data Layer Pipeline")
    parser.add_argument("--file", type=str, help="Single file path")
    parser.add_argument("--dir", type=str, help="Directory path")
    parser.add_argument("--ingest", action="store_true", help="Dual-write ingestion")
    parser.add_argument("--business-tag", type=str, default="", help="Business tag")
    parser.add_argument("--chunk-profile", type=str, default="default",
                        choices=["default", "source_first", "precision"])
    parser.add_argument("--force-vl", action="store_true", help="Force PDF to use OCR")

    args = parser.parse_args()
    
    pipeline = LlamaRAGPipeline(
        collection_name=args.collection,
        force_vl=args.force_vl,
    )
    
    result = pipeline.run_dual_write_ingest(
        file_path=args.file,
        directory=args.dir,
        business_tag=args.business_tag,
        chunk_profile=args.chunk_profile,
    )

Pipeline Core Flow

LlamaRAGPipeline is the overall orchestrator of data processing. Its internal flow:

class LlamaRAGPipeline:
    def run_dual_write_ingest(self, file_path, directory, 
                               business_tag="", chunk_profile="default"):
        # Step 1: File extraction (automatic routing for 9 formats)
        loader = CustomRAGLoader(force_vl=self.force_vl)
        documents = loader.load_data(file_path=file_path)
        
        # Step 2: Transform cleaning (5 processors chained)
        transforms = [
            TableParseTransform(),    # Table validation
            SymbolMapTransform(),     # Symbol mapping ±→positive/negative tolerance
            OCRCorrectTransform(),    # OCR error correction
            DataCleanTransform(),     # Noise cleaning
            DataNormalizeTransform(), # Normalization
        ]
        for t in transforms:
            documents = t(documents)
        
        # Step 3: Chunk routing (auto-select strategy by content type)
        router = ChunkRouter(profile=get_chunk_profile(chunk_profile))
        chunks = router.route(documents)
        
        # Step 4: Dual-write ingestion (MySQL + Milvus + MinIO)
        dual_write = DualWriteService()
        result = dual_write.store_documents(
            documents=documents,
            business_tag=business_tag,
            embedding_fn=self._build_embed_model(),
            chunk_profile_name=chunk_profile,
        )
        return result

Supported File Formats

The project includes extractors for 9 formats, covering the majority of enterprise file types:

Format	Extractor	Special Handling
PDF	PdfExtractor	Automatically detect text/image type, dual engine switching
Word (.docx/.doc)	WordExtractor	Style preservation, heading hierarchy recognition
PPT (.pptx/.ppt)	PptExtractor	Slide-by-slide extraction
Excel (.xlsx/.xls/.csv)	ExcelExtractor	Multiple sheet support
HTML	HtmlExtractor	Content extraction, tag filtering
TXT	TxtExtractor	Encoding auto-detection
ZIP/RAR	ArchiveExtractor	Recursive extraction after decompression
Image	ImageExtractor	MLX-VLM multimodal understanding
Email	EmailExtractor	Header analysis + body extraction

Each extractor’s output is unified into a standard Document object, so subsequent Transforms and ChunkRouter don’t need to care about the source format.

Troubleshooting and Performance Tuning

Issue 1: CUDA Out of Memory (GPU VRAM insufficient)

Symptom: Error at runtime RuntimeError: CUDA out of memory

Solution:

# Option A: Switch to CPU mode (simplest)
# Set in .env:
EMBEDDING_DEVICE=cpu

# Option B: Reduce batch size
# Modify in settings.py:
EMBEDDING_BATCH_SIZE: int = 8   # from 32 down to 8

# Option C: Enable gradient checkpointing (during fine-tuning)
# Add to training script:
model.gradient_checkpointing_enable()

Issue 2: Milvus Connection Failed

Symptom: ConnectionError: Failed to connect to Milvus

Troubleshooting steps:

# 1. Confirm containers are running
docker compose ps

# 2. Confirm port is accessible
curl http://localhost:9091/healthz

# 3. Check Milvus logs
docker compose logs milvus --tail=50

# 4. Check if firewall is blocking port 19530
netstat -an | grep 19530

Issue 3: PaddleOCR Installation Fails or Loads Slowly

Symptom: pip install paddleocr errors, or first load takes over 5 minutes

Solution:

# If PDF OCR functionality is not needed, you can skip
# In PdfExtractor, set use_ocr=False

# Or install a lighter version
pip install paddlepaddle>=3.2.1
pip install paddleocr[doc-parser] --no-deps
# Manually install other dependencies of paddleocr

Issue 4: MySQL Character Encoding Garbled

Symptom: Chinese characters display as ???? after storing in MySQL

Solution: Ensure the following three points:

MySQL container startup command includes --character-set-server=utf8mb4
SQLAlchemy connection URL includes ?charset=utf8mb4
Database and table both use utf8mb4 charset

Performance Tuning Reference Table

Parameter	Default	Lower for low VRAM	Higher for high VRAM	Impact
`EMBEDDING_BATCH_SIZE`	32	8	64	Batch vectorization speed
`PARENT_CHUNK_SIZE`	1024	512	2048	Parent chunk size
`CHILD_CHUNK_SIZE`	128	64	256	Child chunk granularity
`MILVUS HNSW.M`	16	8	32	Index recall vs memory
`MILVUS HNSW.efConstruction`	200	100	500	Index build speed vs quality

FAQ

Q: Can I run without a GPU?

A: Yes. This project fully supports CPU mode for BGE-M3 inference. Simply set EMBEDDING_DEVICE to cpu in your .env file. CPU mode is about 1/10 to 1/20 the speed of GPU, but sufficient for small amounts of data. If you have a large dataset (tens of thousands of documents), consider renting a cloud GPU (e.g., AutoDL, Alibaba Cloud PAI).

Q: What file formats are supported?

A: Currently supports 9 formats: PDF, Word (.docx/.doc), PPT (.pptx/.ppt), Excel (.xlsx/.xls/.csv), HTML, TXT, ZIP/RAR archives, images, emails. If you need additional formats (e.g., Markdown, ePub), simply add a new Extractor class and register it with the ExtractorDispatcher.

Q: How do I switch the Embedding model?

A: Modify the EMBEDDING_MODEL_PATH in the .env file to point to the new model directory. The project is based on Sentence-Transformers interface and is compatible with all models supporting that format (e.g., text-embedding-3-small, e5-mistral-7b-instruct, etc.). Note that the output dimension of the new model must match the MILVUS_DIMENSION configuration.

Q: What are the differences between the three `--chunk-profile` options?

default: Parent-child chunking — balances retrieval accuracy and context completeness, suitable for most scenarios
source_first: No chunking, source tracing — keeps original document integrity, suitable for legal/contract/financial reporting scenarios
precision: Precise fine-grained — finer chunking granularity, suitable for FAQ/customer service Q&A scenarios

For detailed comparison, read the 3rd article in this series, “How to Chunk for RAG Without Losing Context?”

Q: When should I use `--force-vl`?

A: When your PDF is a scanned document or image-based PDF (text embedded in images), regular text extractors cannot read the content. In this case, add --force-vl to force PaddleOCR-VL visual recognition. The cost is 5-10x slower speed and higher memory usage.

Q: What is the relationship between BGE-M3 and the MTEB leaderboard?

A: MTEB (Massive Text Embedding Benchmark) is currently the most authoritative benchmark for evaluating embedding models. BGE-M3 ranks among the top on the MTEB multilingual leaderboard, especially excelling on Chinese tasks. Choosing a top-ranked MTEB model ensures a good starting point, but fine-tuning in vertical domains is still necessary — this will be detailed in the 4th article of this series.

Q: What if RAG performance is poor without fine-tuning?

A: Generic models do underperform in vertical domains. Try these in priority order:

First, optimize the chunking strategy (switch to parent-child chunking)
Experiment with different top-k and similarity threshold values
Add keyword retrieval (BM25) for multi-path recall
Finally, consider domain fine-tuning (see the 4th tutorial)

Q: How do I handle batch import of large numbers of files?

A: Use the --dir parameter to specify a directory for batch processing:

python main.py \
    --dir /path/to/your/documents/ \
    --ingest \
    --business-tag "Enterprise Knowledge Base" \
    --chunk-profile default

The Pipeline will automatically traverse all supported file formats in the directory, processing them one by one.

Resource Download and Interaction

Final Words

What problems have you encountered while building your RAG system? Feel free to leave a comment, and I’ll respond to each one.

Common problem areas:

What is your business scenario? (Medical/Legal/Finance/Education/Customer Service?)
What tech stack are you currently using?
Which part is the biggest headache? (Inaccurate extraction? Unreasonable chunking? Irrelevant retrieval?)

Next Article Preview: “Always Losing Tables in PDF Extraction? Practical Hybrid Solution with PyMuPDF + PaddleOCR-VL” — We’ll dive deep into the dual-engine architecture of PDF extraction and the MLX-VLM acceleration solution exclusive to Apple Silicon users. Stay tuned!

This article is part of the Enterprise RAG Data Pipeline Hands-on Topic (8 engineering practice articles, to be read alongside the RAG Full-Chain Theory Series).

Articles in This Topic

Chapter	Title
Chapter 1	~~Say Goodbye to Retrieval Hallucinations! Build an Enterprise RAG Data Pipeline Step by Step (with Docker One-Click Deployment)~~
Chapter 2	Always Losing Tables in PDF Extraction? Practical Hybrid Solution with PyMuPDF + PaddleOCR-VL (Including MLX Acceleration)
Chapter 3	How to Chunk for RAG Without Losing Context? 5 Strategies from Beginner to Production Grade (with Decision Tree)
Chapter 4	BGE-M3 Local Fine-Tuning: From Scratch to Production Deployment (with Full Code)
Chapter 5	Milvus Production Environment Collection Design + HNSW Tuning Practical Guide
Chapter 6	Table 4-Level Vectorization: Let RAG Systems Truly Understand Structured Data
Chapter 7	RRF Multi-Fusion Ranking: The Secret Weapon to Improve RAG Retrieval Accuracy by 30%+
Chapter 8	MySQL+Milvus+MinIO Triple Storage Dual-Write Architecture: Building an Enterprise RAG Data Foundation

Internal Theoretical Extensions

The following articles are from the RAG Full-Chain Theory Series, helping you understand the concepts and methodologies this topic relies on:

“RAG Offline: Multi-source Heterogeneous Data Cleaning and Deduplication Strategies” — cleaning, deduplication, and quality gates before ingestion
“RAG in Production: Deployment and Performance Monitoring Practices” — monitoring and capacity planning after Pipeline goes live
“RAG Evaluation: Full-chain Metrics Design and Effectiveness Assessment System” — how to measure whether retrieval and generation are “hallucinating”