📊 Table of Contents

  1. Why Fine-tune BGE-M3?
  2. Core Capabilities of BGE-M3
  3. Fine-tuning Environment Setup Guide
  4. Data Preparation: Building a High-Quality Training Set
  5. Complete Fine-tuning Pipeline Implementation
  6. Model Evaluation and Optimization Strategies
  7. Production-Grade Deployment
  8. Common Issues and Solutions
  9. Summary and Next Steps

Why Fine-tune BGE-M3?

Pain Points in Real-World Scenarios

Imagine this scenario:

User Query: “How should diabetic patients adjust their insulin dosage?”

Generic BGE-M3 Retrieval Results:

  1. ❌ “Basic symptoms and diagnostic criteria of diabetes” (Relevance: 0.62)
  2. ❌ “Overview of insulin types and usage” (Relevance: 0.58)
  3. ✅ “Clinical guidelines for insulin dose adjustment in type 2 diabetes patients” (Relevance: 0.71)

Although the third result is correct, the first two irrelevant entries also score relatively high. This is because generic models lack domain expertise and cannot accurately interpret the precise meaning of the professional term “dosage adjustment.”

Improvements Brought by Fine-tuning

Based on our measured data:

Metric Generic BGE-M3 Domain Fine-tuned Improvement
P@10 (Top-10 Precision) 0.72 0.89 +23.6%
MRR (Mean Reciprocal Rank) 0.65 0.82 +26.2%
NDCG@10 0.68 0.85 +25.0%
Professional Term Recall 0.61 0.91 +49.2%

💡 Key Insight: For vertical domain RAG systems, domain adaptation matters more than model size. A well-fine-tuned 7B parameter model often outperforms an un-tuned 13B model.


Core Capabilities of BGE-M3

Three Core Features

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
graph TB
subgraph BGE_M3["BGE-M3 多功能嵌入模型"]
direction TB

subgraph Multi_Functionality["多功能性 (Multi-Functionality)"]
Dense["🔵 Dense Retrieval<br/>稠密向量检索"]
Sparse["🟢 Sparse Retrieval<br/>稀疏词汇匹配"]
Colbert["🟡 ColBERT<br/>多向量交互"]
end

subgraph Multi_Linguality["多语言支持 (Multi-Linguality)"]
Lang1["🇨🇳 中文"]
Lang2["🇺🇸 英文"]
Lang3["🇯🇵 日文"]
Lang4["+ 100+ 语言"]
end

subgraph Multi_Granularity["多粒度输入 (Multi-Granularity)"]
Short["📝 短句子"]
Medium["📄 中等文档"]
Long["📚 长文档 (8192 tokens)"]
end
end

Input["输入文本"] --> BGE_M3

BGE_M3 --> Output1["Dense Vector (1024维)"]
BGE_M3 --> Output2["Sparse Weights (词元权重)"]
BGE_M3 --> Output3["ColBERT Vectors (多向量)"]

Technical Specifications Comparison

Feature BGE-M3 OpenAI text-embedding-3 E5-mistral
Vector Dimension 1024 1536/3072 1024
Max Sequence Length 8192 tokens 8191 tokens 512 tokens
Language Support 100+ languages 50+ languages Multilingual
Retrieval Modes Dense + Sparse + ColBERT Dense only Dense only
Open Source License MIT Commercial (closed-source) Apache 2.0
Local Deployment ✅ Supported ❌ API call ✅ Supported
VRAM Requirement ~16GB (FP16) N/A ~24GB

⚠️ Note: BGE-M3’s long document support is one of its biggest advantages. Most open-source models only support 512 tokens, whereas BGE-M3 can handle up to 8192 tokens, which is highly beneficial for document chunking in RAG systems.


See also: “RAG Offline: Embedding Model Selection and Domain Adaptation Fine-tuning” — Methodology for embedding selection and domain fine-tuning.

Fine-tuning Environment Setup Guide

Hardware Requirements

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
graph LR
subgraph Hardware["硬件配置建议"]
direction LR

GPU["GPU 显存"]
CPU["CPU 核心"]
RAM["系统内存"]
Storage["存储空间"]
end

subgraph Minimum["最低配置"]
Min_GPU["≥ 16GB VRAM<br/>(RTX 4090/RTX 3090)"]
Min_CPU["≥ 8 cores"]
Min_RAM["≥ 32GB DDR4"]
Min_Storage["≥ 100GB SSD"]
end

subgraph Recommended["推荐配置"]
Rec_GPU["≥ 24GB VRAM<br/>(A6000/A100)"]
Rec_CPU["≥ 16 cores"]
Rec_RAM["≥ 64GB DDR4"]
Rec_Storage["≥ 500GB NVMe SSD"]
end

Hardware --> Minimum
Hardware --> Recommended

Software Environment Setup

1. Create Virtual Environment

1
2
3
4
5
6
7
8
# Create conda environment
conda create -n bge-m3-finetune python=3.10 -y
conda activate bge-m3-finetune

# Or use venv
python -m venv bge-m3-env
source bge-m3-env/bin/activate # Linux/Mac
# bge-m3-env\Scripts\activate # Windows

2. Install Dependencies

1
2
3
4
5
6
7
8
9
10
11
# Install PyTorch (choose based on your CUDA version)
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118

# Install FlagEmbedding framework
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .

# Install other dependencies
pip install datasets transformers accelerate peft bitsandbytes
pip install wandb tensorboard scikit-learn pandas

3. Verify Installation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# test_installation.py
from FlagEmbedding import BGEM3FlagModel
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU model: {torch.cuda.get_device_name(0)}")
print(f"VRAM size: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")

# Test model loading
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
test_sentences = ["This is a test sentence"]
output = model.encode(test_sentences)
print(f"✅ Model loaded successfully! Output shape: {output['dense_vecs'].shape}")

Run the test:

1
python test_installation.py

Expected output:

1
2
3
4
5
PyTorch version: 2.1.0
CUDA available: True
GPU model: NVIDIA GeForce RTX 4090
VRAM size: 24.0 GB
✅ Model loaded successfully! Output shape: (1, 1024)

Data Preparation: Building a High-Quality Training Set

Data Format Requirements

BGE-M3 fine-tuning requires triplet data: (query, positive, negative)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
graph TB
subgraph Data_Format["训练数据格式"]
direction TB

Query["Query (查询)<br/>用户的问题或搜索词"]

Positive["Positive (正例)<br/>与查询高度相关的文档片段"]

Negative["Negative (负例)<br/>与查询不相关或弱相关的文档"]
end

Query --> |"相关性 ≥ 0.8"| Positive
Query --> |"相关性 ≤ 0.3"| Negative

subgraph Example["示例"]
Ex_Q["💬 查询: '胰岛素用量如何调整'"]
Ex_P["✅ 正例: '2型糖尿病患者应根据血糖监测结果...'"]
Ex_N["❌ 负例: '糖尿病的诊断标准包括空腹血糖...'"]
end

Data Collection Strategies

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# collect_training_data.py
"""
Mining high-quality training data from RAG system logs
"""
import json
import pandas as pd
from datetime import datetime, timedelta

class TrainingDataCollector:
def __init__(self, log_file_path: str):
self.log_file_path = log_file_path
self.data = []

def parse_rag_logs(self):
"""Parse query logs from RAG system"""
with open(self.log_file_path, 'r', encoding='utf-8') as f:
for line in f:
try:
log_entry = json.loads(line.strip())
if self._is_valid_query(log_entry):
triple = self._extract_triple(log_entry)
if triple:
self.data.append(triple)
except json.JSONDecodeError:
continue

return pd.DataFrame(self.data, columns=['query', 'positive', 'negative'])

def _is_valid_query(self, log_entry: dict) -> bool:
"""Validate log entry"""
required_fields = ['query', 'retrieved_docs', 'user_feedback']
return all(field in log_entry for field in required_fields)

def _extract_triple(self, log_entry: dict) -> tuple:
"""
Extract training triple from log entry

Strategy:
- Query: user's original query
- Positive: document the user clicked or rated positively
- Negative: documents retrieved but not clicked
"""
query = log_entry['query']
retrieved_docs = log_entry['retrieved_docs']
user_feedback = log_entry['user_feedback']

# Extract positives (documents explicitly marked as helpful)
positives = [
doc['content'] for doc in retrieved_docs
if doc['doc_id'] in user_feedback.get('helpful_docs', [])
]

# Extract negatives (retrieved but no user interaction)
negatives = [
doc['content'] for doc in retrieved_docs
if doc['doc_id'] not in user_feedback.get('helpful_docs', [])
and doc['rank'] > 3 # lower-ranked ones are more likely negatives
]

if positives and negatives:
return (query, positives[0], negatives[0])
return None


# Usage
collector = TrainingDataCollector('rag_query_logs.jsonl')
training_df = collector.parse_rag_logs()
print(f"Collected {len(training_df)} training samples")
print(training_df.head())

Method 2: Synthetic Data Generation with LLM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# generate_synthetic_data.py
"""
Generate high-quality synthetic training data using a large language model
Suitable for cold-start scenarios (no real user logs)
"""
import random
from typing import List, Tuple
import pandas as pd

class SyntheticDataGenerator:
def __init__(self, documents: List[str], domain: str = "medical"):
self.documents = documents
self.domain = domain
self.templates = self._load_templates()

def _load_templates(self) -> dict:
"""Load query templates for different domains"""
templates = {
"medical": [
"What should {disease} patients pay attention to when {action}?",
"How to treat {symptom}?",
"What are the side effects of {drug}?",
"What are the preventive measures for {condition}?",
],
"legal": [
"According to {law}, how to handle {situation}?",
"What is the statute of limitations for {case_type} cases?",
"How to protect {right} when it is infringed?",
],
"finance": {
"How to calculate {financial_concept}?",
"What is the risk level of {investment_type}?",
"What investment strategy should be adopted under {market_condition}?",
}
}
return templates.get(self.domain, templates["medical"])

def generate_queries(self, num_samples: int = 1000) -> List[Tuple[str, str, str]]:
"""
Generate synthetic training data

Returns: [(query, positive, negative), ...]
"""
training_data = []

for _ in range(num_samples):
# Randomly select a document as the positive base
positive_doc = random.choice(self.documents)

# Generate query
query = self._generate_query_from_doc(positive_doc)

# Select negative (randomly pick another document)
negative_doc = random.choice([d for d in self.documents if d != positive_doc])

training_data.append((query, positive_doc[:512], negative_doc[:512]))

return training_data

def _generate_query_from_doc(self, document: str) -> str:
"""Generate a natural language query from document content"""
# Here you could integrate an LLM for more natural queries
# For simplicity, use template approach
template = random.choice(self.templates)

# Extract keywords from document (in practice use NER or keyword extraction)
words = document.split()[:10]
keywords = ' '.join(words[:3])

return template.format(topic=keywords)


# Usage
documents = load_your_documents() # Load your document corpus
generator = SyntheticDataGenerator(documents, domain="medical")
synthetic_data = generator.generate_queries(num_samples=500)

synthetic_df = pd.DataFrame(
synthetic_data,
columns=['query', 'positive', 'negative']
)
print(f"Generated {len(synthetic_df)} synthetic training samples")
synthetic_df.to_csv('synthetic_training_data.csv', index=False)

Method 3: Manual Annotation (High quality but high cost)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# annotation_interface.py
"""
Manual annotation tool interface description
"""
annotation_guide = """

## Training Data Annotation Guide
### Annotation Task
For each query, judge the relevance level of the document snippet:

### Relevance Level Definitions
- **⭐⭐⭐ Highly Relevant (Positive)**: Document directly answers the query
- **⭐⭐ Somewhat Relevant**: Document contains related information but not complete
- **⭐ Not Relevant (Negative)**: Document is unrelated or misleading

### Annotation Examples
**Query**: "How to adjust insulin dosage?"

**Document A**: "Type 2 diabetes patients should monitor fasting blood glucose weekly..."
→ Rating: ⭐⭐⭐ (Highly relevant - direct answer)

**Document B**: "Diagnostic criteria for diabetes include..."
→ Rating: ⭐ (Not relevant - different topic)

**Document C**: "Insulin types include rapid-acting, short-acting, intermediate-acting..."
→ Rating: ⭐⭐ (Somewhat relevant - background information)

### Quality Control
- Each sample independently annotated by at least 2 people
- Consistency requirement: Cohen's Kappa ≥ 0.6
- Disputes resolved by expert arbitration
"""

print(annotation_guide)

Data Quality Checks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# data_quality_check.py
"""
Training data quality check script
"""
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def check_data_quality(df: pd.DataFrame) -> dict:
"""
Check training data quality
Returns a quality report dictionary
"""
report = {}

# 1. Basic statistics
report['total_samples'] = len(df)
report['avg_query_length'] = df['query'].str.len().mean()
report['avg_positive_length'] = df['positive'].str.len().mean()
report['avg_negative_length'] = df['negative'].str.len().mean()

# 2. Check duplicates
duplicates = df.duplicated(subset=['query', 'positive']).sum()
report['duplicate_rate'] = duplicates / len(df) * 100

# 3. Check positive-negative similarity (ensure negatives are indeed not similar)
vectorizer = TfidfVectorizer(max_features=1000)

pos_vectors = vectorizer.fit_transform(df['positive'])
neg_vectors = vectorizer.transform(df['negative'])

similarities = []
for i in range(len(df)):
sim = cosine_similarity(
pos_vectors[i:i+1],
neg_vectors[i:i+1]
)[0][0]
similarities.append(sim)

report['avg_pos_neg_similarity'] = np.mean(similarities)
report['max_pos_neg_similarity'] = np.max(similarities)

# 4. Quality score
quality_score = 100
if report['duplicate_rate'] > 5:
quality_score -= 20
print("⚠️ Warning: High duplicate rate (>5%)")

if report['avg_pos_neg_similarity'] > 0.3:
quality_score -= 30
print("⚠️ Warning: Positives and negatives are too similar")

if report['total_samples'] < 1000:
quality_score -= 20
print("⚠️ Warning: Insufficient training samples (<1000)")

report['quality_score'] = quality_score

return report


# Usage
df = pd.read_csv('training_data.csv')
quality_report = check_data_quality(df)

print("\n=== Data Quality Report ===")
for key, value in quality_report.items():
if isinstance(value, float):
print(f"{key}: {value:.2f}")
else:
print(f"{key}: {value}")

if quality_report['quality_score'] >= 80:
print("\n✅ Data quality acceptable, ready for training!")
else:
print("\n❌ Data quality insufficient, needs cleaning before training")

Complete Fine-tuning Pipeline Implementation

Fine-tuning Architecture Overview

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
flowchart TB
subgraph Preparation["阶段1: 数据准备"]
A1[原始文档] --> A2[数据收集]
A2 --> A3[质量控制]
A3 --> A4[格式转换]
A4 --> A5[训练集/验证集划分]
end

subgraph FineTuning["阶段2: 模型微调"]
B1[加载预训练模型] --> B2[配置训练参数]
B2 --> B3[设置LoRA适配器]
B3 --> B4[执行训练循环]
B4 --> B5[保存检查点]
end

subgraph Evaluation["阶段3: 评估优化"]
C1[加载最佳模型] --> C2[在验证集上评估]
C2 --> C3[分析错误案例]
C3 --> C4[超参数调优]
C4 --> C5[最终模型导出]
end

subgraph Deployment["阶段4: 生产部署"]
D1[模型量化] --> D2[服务化封装]
D2 --> D3[性能测试]
D3 --> D4[A/B测试]
D4 --> D5[全量上线]
end

Preparation --> FineTuning
FineTuning --> Evaluation
Evaluation --> Deployment

Complete Fine-tuning Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
# finetune_bge_m3.py
"""
BGE-M3 full fine-tuning script
Supports unified fine-tuning of Dense, Sparse, and ColBERT modes
"""

import os
import json
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
AutoTokenizer,
AutoModel,
Trainer,
TrainingArguments,
get_linear_schedule_with_warmup
)
from peft import LoraConfig, get_peft_model, TaskType
import numpy as np
from typing import Dict, List, Optional
from dataclasses import dataclass
import logging
from pathlib import Path

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


@dataclass
class FinetuneConfig:
"""Fine-tuning configuration class"""
# Model config
model_name_or_path: str = "BAAI/bge-m3"
output_dir: str = "./bge-m3-finetuned"

# Data config
train_file: str = "./data/train.jsonl"
valid_file: str = "./data/valid.jsonl"
max_seq_length: int = 512

# Training config
num_train_epochs: int = 3
per_device_train_batch_size: int = 4
per_device_eval_batch_size: int = 8
learning_rate: float = 2e-5
warmup_ratio: float = 0.1
weight_decay: float = 0.01
max_grad_norm: float = 1.0
gradient_accumulation_steps: int = 4

# LoRA config
use_lora: bool = True
lora_r: int = 16
lora_alpha: int = 32
lora_dropout: float = 0.05

# Other config
save_strategy: str = "steps"
save_steps: int = 100
eval_steps: int = 100
logging_steps: int = 10
fp16: bool = True
bf16: bool = False


class TripletDataset(Dataset):
"""Triplet training dataset"""

def __init__(
self,
data_path: str,
tokenizer: AutoTokenizer,
max_length: int = 512
):
self.tokenizer = tokenizer
self.max_length = max_length
self.data = self._load_data(data_path)

def _load_data(self, data_path: str) -> List[Dict]:
"""Load JSONL format training data"""
data = []
with open(data_path, 'r', encoding='utf-8') as f:
for line in f:
item = json.loads(line.strip())
data.append(item)
logger.info(f"Loaded {len(data)} training samples from {data_path}")
return data

def __len__(self):
return len(self.data)

def __getitem__(self, idx):
item = self.data[idx]

query = item['query']
positive = item['positive']
negative = item['negative']

# Tokenize
query_encoding = self.tokenizer(
query,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)

pos_encoding = self.tokenizer(
positive,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)

neg_encoding = self.tokenizer(
negative,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)

return {
'query_input_ids': query_encoding['input_ids'].squeeze(),
'query_attention_mask': query_encoding['attention_mask'].squeeze(),
'pos_input_ids': pos_encoding['input_ids'].squeeze(),
'pos_attention_mask': pos_encoding['attention_mask'].squeeze(),
'neg_input_ids': neg_encoding['input_ids'].squeeze(),
'neg_attention_mask': neg_encoding['attention_mask'].squeeze(),
}


class BGETrainer(Trainer):
"""Custom Trainer implementing contrastive learning loss"""

def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
"""
Compute InfoNCE loss (contrastive learning loss)

Loss function:
L = -log(exp(sim(q,p)/τ) / [exp(sim(q,p)/τ) + exp(sim(q,n)/τ)])
"""
# Forward pass to get embeddings
outputs = model(**inputs)

query_emb = outputs.query_embedding # (batch_size, hidden_dim)
pos_emb = outputs.pos_embedding # (batch_size, hidden_dim)
neg_emb = outputs.neg_embedding # (batch_size, hidden_dim)

# Compute similarities
temperature = 0.05
pos_sim = torch.cosine_similarity(query_emb, pos_emb, dim=-1) / temperature
neg_sim = torch.cosine_similarity(query_emb, neg_emb, dim=-1) / temperature

# InfoNCE loss
logits = torch.stack([pos_sim, neg_sim], dim=1) # (batch_size, 2)
labels = torch.zeros(logits.size(0), dtype=torch.long).to(logits.device)

loss = torch.nn.functional.cross_entropy(logits, labels)

return (loss, outputs) if return_outputs else loss


def setup_model_for_finetuning(config: FinetuneConfig):
"""
Set up model for fine-tuning
Supports both full fine-tuning and LoRA fine-tuning
"""
logger.info(f"Loading pretrained model: {config.model_name_or_path}")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(config.model_name_or_path)
model = AutoModel.from_pretrained(
config.model_name_or_path,
trust_remote_code=True
)

# Configure LoRA (if enabled)
if config.use_lora:
logger.info("Configuring LoRA adapter...")
lora_config = LoraConfig(
task_type=TaskType.FEATURE_EXTRACTION,
r=config.lora_r,
lora_alpha=config.lora_alpha,
lora_dropout=config.lora_dropout,
target_modules=["query", "key", "value"],
bias="none"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

return model, tokenizer


def main():
"""Main training function"""
# Initialize config
config = FinetuneConfig()

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
logger.info(f"Using device: {device}")

# Set up model
model, tokenizer = setup_model_for_finetuning(config)
model.to(device)

# Prepare datasets
train_dataset = TripletDataset(
config.train_file,
tokenizer,
max_length=config.max_seq_length
)

valid_dataset = TripletDataset(
config.valid_file,
tokenizer,
max_length=config.max_seq_length
) if os.path.exists(config.valid_file) else None

# Training arguments
training_args = TrainingArguments(
output_dir=config.output_dir,
num_train_epochs=config.num_train_epochs,
per_device_train_batch_size=config.per_device_train_batch_size,
per_device_eval_batch_size=config.per_device_eval_batch_size,
learning_rate=config.learning_rate,
warmup_ratio=config.warmup_ratio,
weight_decay=config.weight_decay,
max_grad_norm=config.max_grad_norm,
gradient_accumulation_steps=config.gradient_accumulation_steps,

save_strategy=config.save_strategy,
save_steps=config.save_steps,
eval_steps=config.eval_steps,
logging_steps=config.logging_steps,

fp16=config.fp16,
bf16=config.bf16,

load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,

dataloader_pin_memory=True,
dataloader_num_workers=4,

report_to="tensorboard",
logging_dir=f"{config.output_dir}/logs",
)

# Initialize Trainer
trainer = BGETrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
tokenizer=tokenizer,
)

# Start training
logger.info("🚀 Starting fine-tuning...")
train_result = trainer.train()

# Save final model
logger.info("Saving final model...")
trainer.save_model(f"{config.output_dir}/final")
tokenizer.save_pretrained(f"{config.output_dir}/final")

# Output training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)

logger.info("✅ Fine-tuning completed!")


if __name__ == "__main__":
main()

Running the Fine-tuning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 1. Prepare data directory
mkdir -p ./data ./bge-m3-finetuned

# 2. Place training data in ./data directory
# train.jsonl and valid.jsonl

# 3. Run fine-tuning
python finetune_bge_m3.py \
--model_name_or_path BAAI/bge-m3 \
--train_file ./data/train.jsonl \
--valid_file ./data/valid.jsonl \
--output_dir ./bge-m3-finetuned \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--learning_rate 2e-5 \
--use_lora True \
--fp16 True

# 4. Monitor training process
tensorboard --logdir=./bge-m3-finetuned/logs

Training Monitoring Dashboard

During training, you can monitor progress via TensorBoard:

1
2
3
4
# Start TensorBoard
tensorboard --logdir=./bge-m3-finetuned/logs --port=6006

# Access in browser at http://localhost:6006

Key metrics to monitor:

  • Training Loss: Should steadily decrease
  • Validation Loss: Should decrease smoothly; if it increases, overfitting is happening
  • Learning Rate: Should follow the warmup schedule
  • GPU Utilization: Target >80%
  • GPU Memory: Monitor for OOM

See also: “RAG Evaluation: Full-Pipeline Metrics Design and Effectiveness Evaluation Framework” — How to use metrics to verify gains before and after fine-tuning.

Model Evaluation and Optimization Strategies

Evaluation Metrics Framework

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
graph TB
subgraph Metrics["评估指标体系"]
direction TB

subgraph RetrievalMetrics["检索质量指标"]
P_at_K["Precision@K<br/>Top-K准确率"]
Recall_at_K["Recall@K<br/>召回率"]
MRR["MRR<br/>平均倒数排名"]
NDCG["NDCG@K<br/>归一化折损累积增益"]
end

subgraph EmbeddingQuality["向量质量指标"]
IntraCluster["类内聚类紧密度"]
InterCluster["类间分离度"]
SemanticSpace["语义空间均匀性"]
end

subgraph BusinessMetrics["业务指标"]
CTR["点击率转化"]
UserSatisfaction["用户满意度"]
Latency["推理延迟"]
end
end

Evaluation Code Implementation

# evaluate_model.py
"""
BGE-M3 fine-tuned model evaluation script
Supports multiple evaluation metrics and visual analysis
"""
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
from FlagEmbedding import BGEM3FlagModel
from sklearn.metrics import ndcg_score
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
import time


class ModelEvaluator:
    """Model evaluator"""
    
    def __init__(self, model_path: str, use_fp16: bool = True):
        self.model = BGEM3FlagModel(model_path, use_fp16=use_fp16)
        self.results = defaultdict(list)
    
    def encode_documents(self, documents: List[str]) -> np.ndarray:
        """Encode a collection of documents"""
        output = self.model.encode(
            documents,
            batch_size=32,
            max_length=512,
            return_dense=True,
            return_sparse=False,
            return_colbert_vecs=False
        )
        return output['dense_vecs']
    
    def compute_retrieval_metrics(
        self,
        queries: List[str],
        relevant_docs: Dict[str, List[str]],
        corpus: List[str],
        k_values: List[int] = [1, 5, 10, 20]
    ) -> Dict:
        """
        Compute retrieval metrics
        
        Parameters:
            queries: list of queries
            relevant_docs: {query_id: [relevant_doc_ids]}
            corpus: document corpus
            k_values: list of k values to evaluate
        """
        start_time = time.time()
        
        # Encode all content
        query_embeddings = self.encode_documents(queries)
        corpus_embeddings = self.encode_documents(corpus)
        
        # Compute similarity matrix
        similarity_matrix = np.dot(query_embeddings, corpus_embeddings.T)
        
        metrics = {}
        
        for k in k_values:
            precisions = []
            recalls = []
            rr_list = []  # Reciprocal Ranks
            ndcg_scores = []
            
            for idx, query in enumerate(queries):
                # Get Top-K results
                scores = similarity_matrix[idx]
                top_k_indices = np.argsort(scores)[::-1][:k]
                
                # Get relevant document set
                query_key = f"q_{idx}"
                relevant_set = set(relevant_docs.get(query_key, []))
                
                # Precision@K
                retrieved_relevant = sum(
                    1 for i in top_k_indices if f"doc_{i}" in relevant_set
                )
                precision = retrieved_relevant / k
                precisions.append(precision)
                
                # Recall@K
                recall = retrieved_relevant / len(relevant_set) if relevant_set else 0
                recalls.append(recall)
                
                # MRR
                for rank, i in enumerate(top_k_indices, 1):
                    if f"doc_{i}" in relevant_set:
                        rr_list.append(1.0 / rank)
                        break
                else:
                    rr_list.append(0)
                
                # NDCG@K
                relevance = [1 if f"doc_{i}" in relevant_set else 0 for i in top_k_indices]
                ideal_relevance = sorted(relevance, reverse=True)