Pain Point: Should chunk_size be 500 or 1000? Why is the effect still poor after adjusting N times?

Chunking is the most underestimated step in a RAG system.

Many people think chunking is just “cutting by character count,” then spend a lot of time tuning the Embedding model, switching vector databases, and optimizing Prompts—while ignoring that chunk quality is the fundamental factor determining the retrieval ceiling.

A Real Failure Case

You have a technical document about the “Transformer Attention Mechanism”:

1
2
3
4
5
6
7
8
9
10
11
12
[Original Text - about 2000 characters]
Transformer is a deep learning architecture based on the self-attention mechanism...
The calculation process of the attention mechanism is as follows:
1. Generation of Query, Key, Value matrices
2. The formula for scaled dot-product attention: Attention(Q,K,V) = softmax(QK^T / √d_k)V
3. Parallel computation of multi-head attention
4. Residual connection and layer normalization...

In practical applications, the following issues need attention:
- Gradient vanishing can be mitigated by residual connections
- Position encoding is crucial for modeling sequence order
- A warmup strategy should be used during training...

If you use a fixed size of 500 characters:

Chunk Content Issue
Chunk 1 “Transformer is a deep learning architecture based on…” ✅ Complete introductory section
Chunk 2 “…The calculation process of the attention mechanism is as follows:\n1. Query, Key, Value…” ⚠️ Formula truncated
Chunk 3 “…√d_k)V\n3. Parallel computation of multi-head attention\n4. Residual connection…” ❌ Formula cut in the middle, semantic break
Chunk 4 “…In practical applications, the following issues need attention…” ❌ Missing preceding context

When a user asks, “How to solve gradient vanishing in Transformer?”:

  • Chunk 4 contains the answer but lacks the “Transformer” context
  • Vector retrieval might match irrelevant content mentioning “gradient vanishing” from other documents
  • The final LLM receives a fragmented, incomplete Context

Core Contradiction: smaller chunks → more precise retrieval, but incomplete context; larger chunks → complete context, but lower retrieval precision. Parent-Child Chunking is the ultimate solution to this contradiction.


Panoramic Comparison of 5 Strategies

RAG Chunking Strategies: Comparison of 5 Approaches

Figure 1: Applicable scenarios for five chunking strategies

Key Metrics Quantitative Comparison

Strategy Storage Expansion Implementation Difficulty Recall Context Preservation
No Chunking 1x 40% 100%
Fixed Size 2~3x ⭐⭐ 72% 45%
Parent-Child ~2.5x ⭐⭐⭐⭐ 91% 98%
Semantic 2~4x ⭐⭐⭐ 78% 85%
Table 4-Level 5~8x ⭐⭐⭐⭐⭐ 96%* N/A*

* Evaluated on table data only


Strategy 1: NoChunkStrategy — No Chunking

The simplest strategy: keep the original text as is, ingest as a single chunk.

1
2
3
4
5
6
7
8
9
10
11
class NoChunkStrategy(BaseChunkStrategy):
"""No Chunking — Keep original granularity. Suitable for: legal contracts, financial reports, traceability-first scenarios."""

def chunk(self, doc: Document, rule: ContentTypeRule) -> list[dict]:
return [{
"content": doc.content,
"chunk_role": "original",
"chunk_seq": 0,
"parent_id": "",
"content_type": doc.content_type.value,
}]

When to use? When your business requirement is “find this document” rather than “find this paragraph.” For example, a lawyer searching a contract database needs the complete contract file.


See also within the site: 《RAG Offline Part: Metadata Enhancement and Knowledge Graph Fusion Preprocessing》 — Relationship between chunk boundaries, document structure, and metadata

Strategy 2: FixedSizeStrategy — Fixed Size Chunking

The most commonly used beginner strategy. The key design is boundary detection — not a hard cut, but a look-back to find the nearest sentence boundary.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class FixedSizeStrategy(BaseChunkStrategy):
"""
Fixed Size Chunking + Boundary Detection.

Parameters:
- chunk_size: Target character count (default 512)
- chunk_overlap: Adjacent overlap (default 50)
- min_chunk_size: Minimum allowed length (default 50)
"""

def chunk(self, doc: Document, rule: ContentTypeRule) -> list[dict]:
text = doc.content
size = rule.chunk_size
overlap = rule.chunk_overlap
min_size = rule.min_chunk_size

if len(text) <= size:
return [{"content": text, "chunk_role": "chunk", ...}]

chunks = []
start = 0
seq = 0

while start < len(text):
end = start + size

# Key: look back to find the best split point
if end < len(text):
boundary = self._find_boundary(text[start:end])
if boundary > min_size:
end = start + boundary

chunk_text = text[start:end].strip()
if len(chunk_text) >= min_size:
chunks.append({"content": chunk_text, "chunk_role": "chunk",
"chunk_seq": seq, "parent_id": "", ...})
seq += 1

start = end - overlap if end < len(text) else end

return chunks

@staticmethod
def _find_boundary(text: str) -> int:
"""Find the best split point by priority"""
for sep in ["\n\n", "。", "!", "?", ".", "!", "\n", ";", ";"]:
pos = text.rfind(sep)
if pos > len(text) * 0.3: # Do not look back more than 30%
return pos + len(sep)
return len(text)

Boundary Detection Priority Logic

1
2
3
4
5
6
7
8
Input: "...The core of the attention mechanism is scaled dot-product. The specific formula is Attention=softmax(QK^T/√d_k)V..."

Search process:
1. Look for "\n\n" (paragraph boundary) → Not found
2. Look for "。" (period) → Found at position 18 ✓
3. Check 18 > 512*0.3? → NO, too early
4. Continue looking for next "。" → Found at position 380 ✓
5. Check 380 > 154? → YES! Cut here

Strategy 3: ParentChildStrategy ⭐ — Parent-Child Chunking (Core Focus)

This is the core content of this article and the default strategy used in the project.

Intuitive Understanding

1
2
3
4
5
6
7
8
9
10
Traditional approach (Fixed Size Chunking):
Q: "How to solve Transformer gradient vanishing?"
Returns a note (128 chars): "...residual connection can mitigate..."
← You don't know what was said before or after

Parent-Child Chunking approach:
Q: "How to solve Transformer gradient vanishing?"
First finds precise index card (128 chars, high keyword match)
Then retrieves the entire chapter content (1024 chars) based on the index card
← Both precise matching and complete context

Full Source Code Walkthrough

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
class ParentChildStrategy(BaseChunkStrategy):
"""
Parent-Child Chunking Strategy — The gold standard for RAG chunking.

Core Idea:
- Large parent chunk (1024 chars): Contains complete semantic unit, context ultimately returned to LLM
- Small child chunk (128 chars): Fine granularity, semantically focused, target for vector retrieval

Workflow:
1. Document is split into parent chunks according to parent_size
2. Each parent chunk is further split into child chunks according to child_size
3. All chunks are stored in Milvus (each with its own vector)
4. During retrieval: child chunk matched → find parent chunk via parent_id → return parent chunk to LLM

Parameters (adjustable in settings.py):
- PARENT_CHUNK_SIZE: 1024
- PARENT_CHUNK_OVERLAP: 100
- CHILD_CHUNK_SIZE: 128
- CHILD_CHUNK_OVERLAP: 20
"""

def chunk(self, doc: Document, rule: ContentTypeRule) -> list[dict]:
text = doc.content
parent_size = rule.parent_size # 1024
parent_overlap = rule.parent_overlap # 100
child_size = rule.child_size # 128
child_overlap = rule.child_overlap # 20

# Document is small? Treat as a single parent chunk
if len(text) <= parent_size:
parent_id = uuid.uuid4().hex[:16]
children = self._split_children(
text, child_size, child_overlap, parent_id, rule.min_chunk_size
)
result = [{
"content": text,
"chunk_role": "parent",
"chunk_seq": 0,
"parent_id": parent_id,
"content_type": doc.content_type.value,
}]
result.extend(children)
return result

# Normal flow: split parent chunks one by one
result = []
parent_seq = 0
start = 0

while start < len(text):
end = min(start + parent_size, len(text))
parent_text = text[start:end].strip()

if not parent_text:
break

parent_id = uuid.uuid4().hex[:16]

# Add parent chunk
result.append({
"content": parent_text,
"chunk_role": "parent",
"chunk_seq": parent_seq,
"parent_id": parent_id,
"content_type": doc.content_type.value,
})

# Split current parent chunk into child chunks
children = self._split_children(
parent_text, child_size, child_overlap,
parent_id, rule.min_chunk_size
)
result.extend(children)

parent_seq += 1
start = end - parent_overlap if end < len(text) else end

return result

@staticmethod
def _split_children(text, size, overlap, parent_id, min_size):
"""Split a parent chunk into multiple child chunks, each carrying parent_id"""
children = []
start = 0
seq = 0
while start < len(text):
end = min(start + size, len(text))
chunk = text[start:end].strip()
if len(chunk) >= min_size:
children.append({
"content": chunk,
"chunk_role": "child",
"chunk_seq": seq,
"parent_id": parent_id,
"content_type": "text",
})
seq += 1
start = end - overlap if end < len(text) else end
return children

Why is “Retrieve Child Chunk, Return Parent Chunk” the Best Practice?

Dimension Parent Only Child Only Parent-Child
Retrieval Precision Low (too general) High (precise) High (child chunk retrieval)
Context Completeness High (complete) Low (fragmented) High (returns parent chunk)
Vector Quality Information diluted Semantically focused Each plays its role
Storage Overhead 1x ~8x ~2.5x

Parent-Child Chunking Principle Diagram

Parent-Child Chunking Working Principle

Figure 2: Child block retrieval, parent_id backtracking to parent block before feeding into LLM


Strategy 4: SemanticStrategy — Semantic Chunking

Segments the document based on natural paragraphs and heading hierarchy, rather than a fixed character count.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class SemanticStrategy(BaseChunkStrategy):
"""
Semantic Chunking — Respects the natural structure of the document.

Supported heading formats:
- Markdown: # ## ### ####
- Chinese: 第一章、一、(一)、1.
- Numeric: 1.1 1.2 2.3.1
"""

HEADING_PATTERNS = [
re.compile(r'^(#{1,6}\s)'),
re.compile(r'^第[一二三四五六七八九十百]+[章节篇部]'),
re.compile(r'^[一二三四五六七八九十]+[、..]'),
re.compile(r'^\d+[、..\s]'),
re.compile(r'^\d+\.\d+[\s]'),
]

def chunk(self, doc: Document, rule: ContentTypeRule) -> list[dict]:
text = doc.content
max_size = rule.chunk_size
min_size = rule.min_chunk_size

paragraphs = re.split(r'\n\s*\n', text)
chunks = []
current = ""
seq = 0

for para in paragraphs:
para = para.strip()
if not para:
continue

is_heading = any(p.match(para) for p in self.HEADING_PATTERNS)

if is_heading and len(current) >= min_size:
chunks.append({"content": current.strip(), "chunk_role": "semantic_chunk", ...})
seq += 1
current = para + "\n\n"
elif len(current) + len(para) > max_size and len(current) >= min_size:
chunks.append({"content": current.strip(), "chunk_role": "semantic_chunk", ...})
seq += 1
current = para + "\n\n"
else:
current += para + "\n\n"

if current.strip() and len(current.strip()) >= min_size:
chunks.append({"content": current.strip(), "chunk_role": "semantic_chunk", ...})

return chunks if chunks else [{"content": text, "chunk_role": "semantic_chunk"}]

Suitable for: textbooks, API documentation, technical manuals, and other documents with clear heading hierarchies.


See also within the site: 《RAG Online Part: Retrieval Optimization — HyDE and Query Expansion Techniques》 — Coordination between chunk granularity, query rewriting, and expansion

Strategy 5: Table4LevelStrategy — Table 4-Level Vectorization

📌 This strategy will be detailed in Article 6: “RAG Always Ignores Table Data? 4-Level Granularity Vectorization Solution”.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class Table4LevelStrategy(BaseChunkStrategy):
"""
Table 4-Level Vectorization: table → row → col → cell
Each granularity corresponds to different query intents.
"""

def chunk(self, doc: Document, rule: ContentTypeRule) -> list[dict]:
from data_base.table.table_vectorizer import table_vectorizer
headers, rows = table_vectorizer.parse_html_table(doc.raw_content or "")

if not headers:
return [{"content": doc.content, "chunk_role": "table_full"}]

table_id = uuid.uuid4().hex[:16]
vec_entries = table_vectorizer.vectorize_table(
table_id=table_id, headers=headers, rows=rows,
)

chunks = []
for entry in vec_entries:
chunks.append({
"content": entry["text_content"],
"chunk_role": f"table_{entry['vector_level']}",
"parent_id": table_id,
"vector_level": entry["vector_level"],
"associate_id": entry["associate_id"],
})
return chunks

ChunkRouter Automatic Router

With 5 strategies, who decides which document uses which? The answer is ChunkRouter.

ChunkRouter and Profile Selection

Figure 3: ChunkRouter routing and Profile decision tree


Three Preset Profile Configurations

1
2
3
4
5
6
7
rules=[
ContentTypeRule(content_type="text", strategy=PARENT_CHILD,
parent_size=1024, child_size=128),
ContentTypeRule(content_type="table", strategy=TABLE_4LEVEL),
ContentTypeRule(content_type="image", strategy=NO_CHUNK),
ContentTypeRule(content_type="formula", strategy=NO_CHUNK),
]

source_first (Traceability First)

All use NO_CHUNK, suitable for legal/contract/financial report scenarios.

precision (Precision Mode)

Text uses a smaller FixedSize(256), suitable for FAQ/customer service short Q&A.

Usage: --chunk-profile default


Pitfall Guide

Pitfall 1: Overlap too large causing excessive repetition

Phenomenon: LLM receives context with a lot of repeated text

Solution: CHILD_CHUNK_OVERLAP recommended as child_size * 15% (20 for 128 chars), do not exceed 30%

Pitfall 2: Child chunks too small causing noise surge

Phenomenon: 128-character child chunks contain many meaningless fragments

Solution: Set a reasonable min_chunk_size (default 50), filter out overly short fragments

Pitfall 3: Chinese sentence segmentation cuts at wrong positions

Phenomenon: A Chinese word split in the middle (e.g., “注意力机” + “制”)

Solution: Add Chinese punctuation and word boundary detection in _find_boundary, or use jieba segmentation for auxiliary judgment


FAQ

Q: Storage increases by 2.5x, is it worth it?

A: Absolutely worth it. Recall improves from 72% (fixed size) to 91% (parent-child chunking), meaning 19 more relevant results per 100 queries. For a production system, this improvement far outweighs the storage cost.

Q: When should I use source_first (no chunking)?

A: When your downstream requirement is “locate the original source” rather than “extract answer fragments.” Typical scenarios: legal compliance review, audit trails, citation traceability.

Q: How to use parent_id in retrieval for parent-child chunking?

A: After retrieving a child chunk, query Milvus or MySQL using the parent_id field to get the corresponding parent chunk content. Then put the parent chunk (not the child chunk) into the LLM’s context. See specific implementation in Article 7: “RRF Multi-Channel Fusion Ranking” under the layered fusion approach.

Q: Can the 5 strategies be mixed in the same project?

A: Yes! That’s the original design intent of ChunkRouter. Documents with different content_type are automatically routed to different strategies — text goes to parent-child chunking, tables go to 4-level vectorization, images are not chunked.

Q: How to validate chunking effectiveness?

A: Recommended methods:

  1. Visual inspection: Randomly sample 10 chunks, manually judge if boundaries are reasonable
  2. Retrieval test: Construct 20 queries covering different topics, check if top-5 results hit the correct chunks
  3. End-to-end evaluation: Use LLM to score generated answers (faithfulness, completeness)

Resource Downloads and Interaction


Topic Navigation and In-Site Extensions

This article belongs to the Enterprise-Level RAG Data Pipeline Practical Guide Series (8 engineering practice articles, to be read alongside the RAG Practical Full-Chain Theory Series).

Articles in This Series

Article Title
Article 1 Say Goodbye to Retrieval Hallucinations! Build an Enterprise-Level RAG Data Pipeline Step-by-Step (with Docker One-Click Deployment)
Article 2 PDF Extraction Always Loses Tables? Practical PyMuPDF + PaddleOCR-VL Hybrid Solution (with MLX Acceleration)
Article 3 How to Chunk for RAG Without Losing Context? 5 Strategies from Beginner to Production-Ready (with Selection Decision Tree)
Article 4 BGE-M3 Local Fine-Tuning Practice: From Scratch to Production Deployment (with Complete Code)
Article 5 Practical Guide to Milvus Production Collection Design + HNSW Tuning
Article 6 Table 4-Level Vectorization Scheme: Let RAG Systems Truly Understand Structured Data
Article 7 RRF Multi-Channel Fusion Ranking: The Secret Weapon to Boost RAG Retrieval Accuracy by 30%+
Article 8 MySQL+Milvus+MinIO Triple Store Dual-Write Architecture: Building an Enterprise-Level RAG Data Foundation

In-Site Theoretical Extensions

The following articles are from the RAG Full-Chain Theory Series, helping to understand the concepts and methodologies underlying this series: