Pain Point: Using Generic Embedding Models for RAG, Getting Irrelevant Answers in Vertical Domains?
Have you ever encountered this:
- Using OpenAI’s
text-embedding-3-smallfor medical document retrieval, asking “what are the complications of diabetes,” getting “symptoms of a cold” - Using a generic BGE model for legal contracts, it confuses “liability for breach of contract” with “force majeure”
- Spent days setting up LangChain + ChromaDB, but when you throw real business data into it, the results are disastrous
The core reason isn’t that the model isn’t good enough, but that your RAG data pipeline isn’t built correctly.
A production-grade RAG system goes far beyond the three steps of “split text → vectorize → store in vector database.” You need:
- Multi-format file extraction (PDF/Word/PPT/Excel/HTML…)
- Intelligent chunking strategies (not just cutting by word count)
- Domain-specific Embedding model (BGE-M3 deployed locally)
- Dual-write storage architecture (MySQL metadata + Milvus vector database)
- Special table handling (the biggest blind spot of traditional solutions)
By the end of this article, you will have:
- ✅ A complete, runnable RAG data processing Pipeline
- ✅ Docker Compose one-click startup of the full-stack environment
- ✅ Automatic extraction capability for 9 file formats
- ✅ BGE-M3 1024-dimension local vectorization
- ✅ Triple storage dual-write: Milvus + MySQL + MinIO
Pre-flight Checklist
Before you start, confirm your environment meets the following requirements:
Hardware Requirements
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| CPU | 4 cores | 8+ cores | Required for PaddleOCR and BGE-M3 inference |
| RAM | 16 GB | 32+ GB | PaddleOCR-VL loading requires ~4-6GB |
| GPU VRAM | None (CPU can run) | ≥12 GB | Needed for BGE-M3 fine-tuning, inference can run on CPU |
| Disk | 20 GB free | 50+ GB | Milvus data + model files + raw files |
💡 Tip: No GPU? No problem. This project supports CPU mode for BGE-M3 inference, just slower. If you only have CPU, set
EMBEDDING_DEVICEto"cpu".
Software Environment
| Software | Version Requirement | Validation Command |
|---|---|---|
| Python | 3.10.x | python --version |
| Docker | ≥24.0 | docker --version |
| Docker Compose | ≥2.20 | docker compose version |
| Git | Latest | git --version |
1 | |
Architecture Overview
First, check the overall architecture to understand what we’re building:
Figure 1: Enterprise RAG Offline Data Processing Architecture
See internal article: “RAG Offline: Multi-source Heterogeneous Data Cleaning and Deduplication Strategies” — cleaning, deduplication, and quality gates before ingestion
Step 1: Docker One-Click Startup of Full-Stack Environment
We use Docker Compose to start three storage services simultaneously:
docker-compose.yml
1 | |
Start Services
1 | |
Expected output (healthy indicates readiness):
1 | |
Step 2: Clone Project and Install Dependencies
1 | |
Core Dependencies
| Package | Version | Purpose |
|---|---|---|
llama-index |
≥0.11.0 | RAG framework core |
pymilvus |
≥2.3.0 | Milvus vector database client |
paddleocr[doc-parser] |
≥3.3.0 | PDF OCR engine (image-based) |
PyMuPDF |
≥1.23.0 | Fast PDF text extraction |
pdfplumber |
≥0.10.0 | Precise PDF table extraction |
sqlalchemy |
≥2.0.0 | MySQL ORM |
pymysql |
≥1.1.0 | MySQL driver |
sentence-transformers |
≥2.7.0 | BGE-M3 Embedding model |
⚠️ Note: The PaddleOCR package is large (~500MB) and installation may take a while. If you don’t need PDF OCR functionality, you can remove paddleocr-related dependencies from
requirements.txt.
Step 3: Configure Environment Variables
The project uses Pydantic Settings for configuration, supporting environment variable overrides. Create a .env file:
1 | |
See internal article: “RAG in Production: Deployment and Performance Monitoring Practices” — monitoring and capacity planning after Pipeline goes live
Step 4: Download BGE-M3 Model
BGE-M3 is a multilingual embedding model developed by BAAI, supporting three retrieval modes: Dense, Sparse, and ColBERT.
1 | |
After download, the model directory structure looks like:
1 | |
💡 Why BGE-M3?
- Supports 100+ languages (excellent Chinese performance)
- Outputs 1024-dimensional vectors (high precision)
- Maximum context length of 8192 tokens (friendly for long documents)
- Top-tier performance on MTEB multilingual leaderboard
Step 5: Throw a PDF In and See the Full Data Flow
Now let’s run it for real! Prepare a test PDF file:
1 | |
CLI Parameter Description
| Parameter | Description | Example |
|---|---|---|
--file |
Single file path | /path/to/doc.pdf |
--dir |
Directory path (batch) | /path/to/data/ |
--ingest |
Enable dual-write ingestion mode | (flag) |
--business-tag |
Business tag (data isolation) | "Medical Knowledge Base" |
--chunk-profile |
Chunking strategy | default / source_first / precision |
--force-vl |
Force PDF to use OCR | (flag) |
--extract-only |
Extract only, no ingestion | (flag) |
Sample Run Log Output
1 | |
From the log, you can see the complete processing flow:
- Extraction Phase: PDF identified as text-based → PyMuPDF extraction → got 3 text + 2 tables + 1 image
- Cleaning Phase: Table validation passed + symbol mapping replacement
- Chunking Phase: Parent-child chunking strategy → 6 documents became 28 chunks (5 parent + 18 child + 5 table)
- Ingestion Phase: BGE-M3 vectorization → 28 vectors written to Milvus + 28 records written to MySQL
Step 6: Verify Milvus Vector Writes
After successful writing, let’s verify that vector retrieval works correctly:
1 | |
Expected output:
1 | |
🔑 The closer the score is to 1, the higher the similarity. Milvus uses COSINE (cosine similarity), range [-1, 1]. For similar texts, it’s typically in the [0.7, 1.0] range.
See internal article: “RAG Evaluation: Full-chain Metrics Design and Effectiveness Assessment System” — how to measure whether retrieval and generation are “hallucinating”
Core Code Explanation
Main Entry: main.py
The project entry is very clean, using argparse to provide a CLI interface:
1 | |
Pipeline Core Flow
LlamaRAGPipeline is the overall orchestrator of data processing. Its internal flow:
1 | |
Supported File Formats
The project includes extractors for 9 formats, covering the majority of enterprise file types:
| Format | Extractor | Special Handling |
|---|---|---|
| PdfExtractor | Automatically detect text/image type, dual engine switching | |
| Word (.docx/.doc) | WordExtractor | Style preservation, heading hierarchy recognition |
| PPT (.pptx/.ppt) | PptExtractor | Slide-by-slide extraction |
| Excel (.xlsx/.xls/.csv) | ExcelExtractor | Multiple sheet support |
| HTML | HtmlExtractor | Content extraction, tag filtering |
| TXT | TxtExtractor | Encoding auto-detection |
| ZIP/RAR | ArchiveExtractor | Recursive extraction after decompression |
| Image | ImageExtractor | MLX-VLM multimodal understanding |
| EmailExtractor | Header analysis + body extraction |
Each extractor’s output is unified into a standard Document object, so subsequent Transforms and ChunkRouter don’t need to care about the source format.
Troubleshooting and Performance Tuning
Issue 1: CUDA Out of Memory (GPU VRAM insufficient)
Symptom: Error at runtime RuntimeError: CUDA out of memory
Solution:
1 | |
Issue 2: Milvus Connection Failed
Symptom: ConnectionError: Failed to connect to Milvus
Troubleshooting steps:
1 | |
Issue 3: PaddleOCR Installation Fails or Loads Slowly
Symptom: pip install paddleocr errors, or first load takes over 5 minutes
Solution:
1 | |
Issue 4: MySQL Character Encoding Garbled
Symptom: Chinese characters display as ???? after storing in MySQL
Solution: Ensure the following three points:
- MySQL container startup command includes
--character-set-server=utf8mb4 - SQLAlchemy connection URL includes
?charset=utf8mb4 - Database and table both use
utf8mb4charset
Performance Tuning Reference Table
| Parameter | Default | Lower for low VRAM | Higher for high VRAM | Impact |
|---|---|---|---|---|
EMBEDDING_BATCH_SIZE |
32 | 8 | 64 | Batch vectorization speed |
PARENT_CHUNK_SIZE |
1024 | 512 | 2048 | Parent chunk size |
CHILD_CHUNK_SIZE |
128 | 64 | 256 | Child chunk granularity |
MILVUS HNSW.M |
16 | 8 | 32 | Index recall vs memory |
MILVUS HNSW.efConstruction |
200 | 100 | 500 | Index build speed vs quality |
FAQ
Q: Can I run without a GPU?
A: Yes. This project fully supports CPU mode for BGE-M3 inference. Simply set EMBEDDING_DEVICE to cpu in your .env file. CPU mode is about 1/10 to 1/20 the speed of GPU, but sufficient for small amounts of data. If you have a large dataset (tens of thousands of documents), consider renting a cloud GPU (e.g., AutoDL, Alibaba Cloud PAI).
Q: What file formats are supported?
A: Currently supports 9 formats: PDF, Word (.docx/.doc), PPT (.pptx/.ppt), Excel (.xlsx/.xls/.csv), HTML, TXT, ZIP/RAR archives, images, emails. If you need additional formats (e.g., Markdown, ePub), simply add a new Extractor class and register it with the ExtractorDispatcher.
Q: How do I switch the Embedding model?
A: Modify the EMBEDDING_MODEL_PATH in the .env file to point to the new model directory. The project is based on Sentence-Transformers interface and is compatible with all models supporting that format (e.g., text-embedding-3-small, e5-mistral-7b-instruct, etc.). Note that the output dimension of the new model must match the MILVUS_DIMENSION configuration.
Q: What are the differences between the three --chunk-profile options?
A:
default: Parent-child chunking — balances retrieval accuracy and context completeness, suitable for most scenariossource_first: No chunking, source tracing — keeps original document integrity, suitable for legal/contract/financial reporting scenariosprecision: Precise fine-grained — finer chunking granularity, suitable for FAQ/customer service Q&A scenarios
For detailed comparison, read the 3rd article in this series, “How to Chunk for RAG Without Losing Context?”
Q: When should I use --force-vl?
A: When your PDF is a scanned document or image-based PDF (text embedded in images), regular text extractors cannot read the content. In this case, add --force-vl to force PaddleOCR-VL visual recognition. The cost is 5-10x slower speed and higher memory usage.
Q: What is the relationship between BGE-M3 and the MTEB leaderboard?
A: MTEB (Massive Text Embedding Benchmark) is currently the most authoritative benchmark for evaluating embedding models. BGE-M3 ranks among the top on the MTEB multilingual leaderboard, especially excelling on Chinese tasks. Choosing a top-ranked MTEB model ensures a good starting point, but fine-tuning in vertical domains is still necessary — this will be detailed in the 4th article of this series.
Q: What if RAG performance is poor without fine-tuning?
A: Generic models do underperform in vertical domains. Try these in priority order:
- First, optimize the chunking strategy (switch to parent-child chunking)
- Experiment with different top-k and similarity threshold values
- Add keyword retrieval (BM25) for multi-path recall
- Finally, consider domain fine-tuning (see the 4th tutorial)
Q: How do I handle batch import of large numbers of files?
A: Use the --dir parameter to specify a directory for batch processing:
1 | |
The Pipeline will automatically traverse all supported file formats in the directory, processing them one by one.
Resource Download and Interaction
Final Words
What problems have you encountered while building your RAG system? Feel free to leave a comment, and I’ll respond to each one.
Common problem areas:
- What is your business scenario? (Medical/Legal/Finance/Education/Customer Service?)
- What tech stack are you currently using?
- Which part is the biggest headache? (Inaccurate extraction? Unreasonable chunking? Irrelevant retrieval?)
Next Article Preview: “Always Losing Tables in PDF Extraction? Practical Hybrid Solution with PyMuPDF + PaddleOCR-VL” — We’ll dive deep into the dual-engine architecture of PDF extraction and the MLX-VLM acceleration solution exclusive to Apple Silicon users. Stay tuned!
Topic Navigation and Internal Extensions
This article is part of the Enterprise RAG Data Pipeline Hands-on Topic (8 engineering practice articles, to be read alongside the RAG Full-Chain Theory Series).
Articles in This Topic
| Chapter | Title |
|---|---|
| Chapter 1 | |
| Chapter 2 | Always Losing Tables in PDF Extraction? Practical Hybrid Solution with PyMuPDF + PaddleOCR-VL (Including MLX Acceleration) |
| Chapter 3 | How to Chunk for RAG Without Losing Context? 5 Strategies from Beginner to Production Grade (with Decision Tree) |
| Chapter 4 | BGE-M3 Local Fine-Tuning: From Scratch to Production Deployment (with Full Code) |
| Chapter 5 | Milvus Production Environment Collection Design + HNSW Tuning Practical Guide |
| Chapter 6 | Table 4-Level Vectorization: Let RAG Systems Truly Understand Structured Data |
| Chapter 7 | RRF Multi-Fusion Ranking: The Secret Weapon to Improve RAG Retrieval Accuracy by 30%+ |
| Chapter 8 | MySQL+Milvus+MinIO Triple Storage Dual-Write Architecture: Building an Enterprise RAG Data Foundation |
Internal Theoretical Extensions
The following articles are from the RAG Full-Chain Theory Series, helping you understand the concepts and methodologies this topic relies on:
- “RAG Offline: Multi-source Heterogeneous Data Cleaning and Deduplication Strategies” — cleaning, deduplication, and quality gates before ingestion
- “RAG in Production: Deployment and Performance Monitoring Practices” — monitoring and capacity planning after Pipeline goes live
- “RAG Evaluation: Full-chain Metrics Design and Effectiveness Assessment System” — how to measure whether retrieval and generation are “hallucinating”