Pain Point: Should chunk_size be 500 or 1000? Why is the effect still poor after adjusting N times?
Chunking is the most underestimated step in a RAG system.
Many people think chunking is just “cutting by character count,” then spend a lot of time tuning the Embedding model, switching vector databases, and optimizing Prompts—while ignoring that chunk quality is the fundamental factor determining the retrieval ceiling.
A Real Failure Case
You have a technical document about the “Transformer Attention Mechanism”:
1 | |
If you use a fixed size of 500 characters:
| Chunk | Content | Issue |
|---|---|---|
| Chunk 1 | “Transformer is a deep learning architecture based on…” | ✅ Complete introductory section |
| Chunk 2 | “…The calculation process of the attention mechanism is as follows:\n1. Query, Key, Value…” | ⚠️ Formula truncated |
| Chunk 3 | “…√d_k)V\n3. Parallel computation of multi-head attention\n4. Residual connection…” | ❌ Formula cut in the middle, semantic break |
| Chunk 4 | “…In practical applications, the following issues need attention…” | ❌ Missing preceding context |
When a user asks, “How to solve gradient vanishing in Transformer?”:
- Chunk 4 contains the answer but lacks the “Transformer” context
- Vector retrieval might match irrelevant content mentioning “gradient vanishing” from other documents
- The final LLM receives a fragmented, incomplete Context
Core Contradiction: smaller chunks → more precise retrieval, but incomplete context; larger chunks → complete context, but lower retrieval precision. Parent-Child Chunking is the ultimate solution to this contradiction.
Panoramic Comparison of 5 Strategies
Figure 1: Applicable scenarios for five chunking strategies
Key Metrics Quantitative Comparison
| Strategy | Storage Expansion | Implementation Difficulty | Recall | Context Preservation |
|---|---|---|---|---|
| No Chunking | 1x | ⭐ | 40% | 100% |
| Fixed Size | 2~3x | ⭐⭐ | 72% | 45% |
| Parent-Child | ~2.5x | ⭐⭐⭐⭐ | 91% | 98% |
| Semantic | 2~4x | ⭐⭐⭐ | 78% | 85% |
| Table 4-Level | 5~8x | ⭐⭐⭐⭐⭐ | 96%* | N/A* |
* Evaluated on table data only
Strategy 1: NoChunkStrategy — No Chunking
The simplest strategy: keep the original text as is, ingest as a single chunk.
1 | |
When to use? When your business requirement is “find this document” rather than “find this paragraph.” For example, a lawyer searching a contract database needs the complete contract file.
See also within the site: 《RAG Offline Part: Metadata Enhancement and Knowledge Graph Fusion Preprocessing》 — Relationship between chunk boundaries, document structure, and metadata
Strategy 2: FixedSizeStrategy — Fixed Size Chunking
The most commonly used beginner strategy. The key design is boundary detection — not a hard cut, but a look-back to find the nearest sentence boundary.
1 | |
Boundary Detection Priority Logic
1 | |
Strategy 3: ParentChildStrategy ⭐ — Parent-Child Chunking (Core Focus)
This is the core content of this article and the default strategy used in the project.
Intuitive Understanding
1 | |
Full Source Code Walkthrough
1 | |
Why is “Retrieve Child Chunk, Return Parent Chunk” the Best Practice?
| Dimension | Parent Only | Child Only | Parent-Child |
|---|---|---|---|
| Retrieval Precision | Low (too general) | High (precise) | High (child chunk retrieval) |
| Context Completeness | High (complete) | Low (fragmented) | High (returns parent chunk) |
| Vector Quality | Information diluted | Semantically focused | Each plays its role |
| Storage Overhead | 1x | ~8x | ~2.5x |
Parent-Child Chunking Principle Diagram
Figure 2: Child block retrieval, parent_id backtracking to parent block before feeding into LLM
Strategy 4: SemanticStrategy — Semantic Chunking
Segments the document based on natural paragraphs and heading hierarchy, rather than a fixed character count.
1 | |
Suitable for: textbooks, API documentation, technical manuals, and other documents with clear heading hierarchies.
See also within the site: 《RAG Online Part: Retrieval Optimization — HyDE and Query Expansion Techniques》 — Coordination between chunk granularity, query rewriting, and expansion
Strategy 5: Table4LevelStrategy — Table 4-Level Vectorization
📌 This strategy will be detailed in Article 6: “RAG Always Ignores Table Data? 4-Level Granularity Vectorization Solution”.
1 | |
ChunkRouter Automatic Router
With 5 strategies, who decides which document uses which? The answer is ChunkRouter.
Figure 3: ChunkRouter routing and Profile decision tree
Three Preset Profile Configurations
default (Recommended)
1 | |
source_first (Traceability First)
All use NO_CHUNK, suitable for legal/contract/financial report scenarios.
precision (Precision Mode)
Text uses a smaller FixedSize(256), suitable for FAQ/customer service short Q&A.
Usage: --chunk-profile default
Pitfall Guide
Pitfall 1: Overlap too large causing excessive repetition
Phenomenon: LLM receives context with a lot of repeated text
Solution: CHILD_CHUNK_OVERLAP recommended as child_size * 15% (20 for 128 chars), do not exceed 30%
Pitfall 2: Child chunks too small causing noise surge
Phenomenon: 128-character child chunks contain many meaningless fragments
Solution: Set a reasonable min_chunk_size (default 50), filter out overly short fragments
Pitfall 3: Chinese sentence segmentation cuts at wrong positions
Phenomenon: A Chinese word split in the middle (e.g., “注意力机” + “制”)
Solution: Add Chinese punctuation and word boundary detection in _find_boundary, or use jieba segmentation for auxiliary judgment
FAQ
Q: Storage increases by 2.5x, is it worth it?
A: Absolutely worth it. Recall improves from 72% (fixed size) to 91% (parent-child chunking), meaning 19 more relevant results per 100 queries. For a production system, this improvement far outweighs the storage cost.
Q: When should I use source_first (no chunking)?
A: When your downstream requirement is “locate the original source” rather than “extract answer fragments.” Typical scenarios: legal compliance review, audit trails, citation traceability.
Q: How to use parent_id in retrieval for parent-child chunking?
A: After retrieving a child chunk, query Milvus or MySQL using the parent_id field to get the corresponding parent chunk content. Then put the parent chunk (not the child chunk) into the LLM’s context. See specific implementation in Article 7: “RRF Multi-Channel Fusion Ranking” under the layered fusion approach.
Q: Can the 5 strategies be mixed in the same project?
A: Yes! That’s the original design intent of ChunkRouter. Documents with different content_type are automatically routed to different strategies — text goes to parent-child chunking, tables go to 4-level vectorization, images are not chunked.
Q: How to validate chunking effectiveness?
A: Recommended methods:
- Visual inspection: Randomly sample 10 chunks, manually judge if boundaries are reasonable
- Retrieval test: Construct 20 queries covering different topics, check if top-5 results hit the correct chunks
- End-to-end evaluation: Use LLM to score generated answers (faithfulness, completeness)
Resource Downloads and Interaction
Topic Navigation and In-Site Extensions
This article belongs to the Enterprise-Level RAG Data Pipeline Practical Guide Series (8 engineering practice articles, to be read alongside the RAG Practical Full-Chain Theory Series).
Articles in This Series
| Article | Title |
|---|---|
| Article 1 | Say Goodbye to Retrieval Hallucinations! Build an Enterprise-Level RAG Data Pipeline Step-by-Step (with Docker One-Click Deployment) |
| Article 2 | PDF Extraction Always Loses Tables? Practical PyMuPDF + PaddleOCR-VL Hybrid Solution (with MLX Acceleration) |
| Article 3 | |
| Article 4 | BGE-M3 Local Fine-Tuning Practice: From Scratch to Production Deployment (with Complete Code) |
| Article 5 | Practical Guide to Milvus Production Collection Design + HNSW Tuning |
| Article 6 | Table 4-Level Vectorization Scheme: Let RAG Systems Truly Understand Structured Data |
| Article 7 | RRF Multi-Channel Fusion Ranking: The Secret Weapon to Boost RAG Retrieval Accuracy by 30%+ |
| Article 8 | MySQL+Milvus+MinIO Triple Store Dual-Write Architecture: Building an Enterprise-Level RAG Data Foundation |
In-Site Theoretical Extensions
The following articles are from the RAG Full-Chain Theory Series, helping to understand the concepts and methodologies underlying this series:
- 《RAG Offline Part: Metadata Enhancement and Knowledge Graph Fusion Preprocessing》 — Relationship between chunk boundaries, document structure, and metadata
- 《RAG Online Part: Retrieval Optimization — HyDE and Query Expansion Techniques》 — Coordination between chunk granularity, query rewriting, and expansion