🎯 Core Questions of This Chapter

How does an LLM understand your database structure?

This is the core challenge of all NL→SQL systems:

  • ❌ The LLM doesn’t know which fields exist in your orders table
  • ❌ The LLM doesn’t know that user_id is a foreign key referencing users.id
  • ❌ The LLM doesn’t know whether the status field values are "pending"/"paid"/"shipped" or 0/1/2

Solution: Build a “Semantic Model” — translate the physical database structure into business language that an LLM can understand.


📐 Architecture Overview

Metadata management and semantic model building flow

Metadata management and semantic model building flow

Three-Stage Pipeline

Three-stage metadata pipeline

Structure import → LLM analysis → Semantic model real-time assembly


🔄 Phase 1: Structure Import (Fully Automated)

1.1 Extracting Metadata from MySQL

When the user configures a data source and clicks “Import Table Structure”, the system automatically executes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# datasource.py (pseudocode)
async def import_metadata(datasource_id: int):
conn = await get_mysql_connection(datasource_id)

tables = []
# Step 1: 获取所有表名
async with conn.cursor() as cursor:
await cursor.execute("SHOW TABLES")
table_rows = await cursor.fetchall()
for row in table_rows:
table_name = row[0]

# Step 2: 获取每个表的字段信息
await cursor.execute(f"DESCRIBE `{table_name}`")
columns = await cursor.fetchall()

# Step 3: 采集样例数据(前5行)
await cursor.execute(f"SELECT * FROM `{table_name}` LIMIT 5")
samples = await cursor.fetchall()

tables.append({
"table_name": table_name,
"table_comment": "", # 可选:从 INFORMATION_SCHEMA 获取
"columns": [
{
"field_name": col[0],
"field_type": col[1],
"field_comment": "",
"is_primary_key": col[3] == "PRI",
"is_foreign_key": col[3] == "MUL",
"is_nullable": col[2] == "YES",
"default_value": col[4],
"sample_values": [row[i] for row in samples]
}
for i, col in enumerate(columns)
]
})

# 写入 SQLite 元数据库
await save_imported_tables(db, datasource_id, tables)

📊 Imported Data Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"table_name": "orders",
"columns": [
{
"field_name": "id",
"field_type": "int(11)",
"is_primary_key": true,
"is_foreign_key": false,
"sample_values": [1, 2, 3, 4, 5]
},
{
"field_name": "user_id",
"field_type": "int(11)",
"is_primary_key": false,
"is_foreign_key": true,
"sample_values": [101, 102, 103, 101, 104]
},
{
"field_name": "total_amount",
"field_type": "decimal(10,2)",
"is_primary_key": false,
"is_foreign_key": false,
"sample_values": [299.00, 158.50, 89.90, 450.00, 128.00]
}
]
}

1.2 Storing to SQLite (Application Metadata Database)

app/services/metadata.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
async def save_imported_tables(db: AsyncSession, datasource_id: int, tables_info: list[dict]):
# 清理旧数据(支持重新导入)
old_tables = await db.execute(
select(TableMeta).where(TableMeta.datasource_id == datasource_id)
)
old_table_ids = [t.id for t in old_tables.scalars().all()]
if old_table_ids:
await db.execute(delete(FieldMeta).where(FieldMeta.table_meta_id.in_(old_table_ids)))
await db.execute(delete(TableRelation).where(...))
await db.execute(delete(TableMeta).where(TableMeta.id.in_(old_table_ids)))

# 批量写入新数据
for table_info in tables_info:
table_meta = TableMeta(
datasource_id=datasource_id,
table_name=table_info["table_name"],
table_comment=table_info.get("table_comment", ""),
)
db.add(table_meta)
await db.flush()

for col in table_info.get("columns", []):
field = FieldMeta(
table_meta_id=table_meta.id,
field_name=col["field_name"],
field_type=col["field_type"],
is_primary_key=col.get("is_primary_key", False),
is_foreign_key=col.get("is_foreign_key", False),
sample_values=json.dumps(col.get("sample_values", [])),
)
db.add(field)

await db.flush()

💡 Design Highlight: Incremental vs Full Replacement

We chose the full replacement strategy (delete-then-insert) for these reasons:

Strategy Pros Cons
Full Replacement (Current) Simple implementation; avoids dirty data Slightly slower for large tables (but metadata volume is negligible)
Incremental Update Preserves manual modifications Requires complex diff logic

Rationale: Metadata usually consists of just dozens of tables and hundreds of fields; full replacement takes <100ms.


🤖 Phase 2: LLM Intelligent Analysis (Core Highlight)

2.1 Why LLM Analysis is Needed

Problems with raw metadata:

1
2
3
4
5
6
-- Raw field names: LLM cannot understand business meaning
SELECT * FROM orders;
-- id, user_id, total_amount, status, created_at ???

-- After LLM annotation:
-- Order ID, User ID (foreign key→users), Order Amount, Order Status (pending/shipped/delivered), Creation Time ✅

Value of LLM:

  1. Add Chinese business names: user_id用户ID
  2. Supplement business descriptions: orders table → “Stores all user order records”
  3. Infer table relationships: Discovers orders.user_id references users.id
  4. Identify enumeration meanings: status=0/1/2"pending/shipped/delivered"

2.2 Prompt Engineering

app/services/llm_gateway.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
async def analyze_metadata(self, tables_info):
system_prompt = """你是一位资深的数据库分析师。请分析给定的数据库表结构,
完成以下任务:

1. 为每张表起一个通俗易懂的中文名称(业务名称)
2. 为每张表写一段中文业务描述
3. 为每个字段起一个中文业务名称
4. 为每个字段写一段中文业务描述
5. 分析并推断表与表之间的关联关系

请返回如下 JSON 结构:
{
"tables": {
"表名": {
"business_name": "中文业务名称",
"business_description": "中文业务描述",
"fields": {
"字段名": {
"business_name": "中文名称",
"business_description": "中文描述"
}
}
}
},
"suggested_relations": [
{
"from_table": "源表名",
"from_field": "源字段名",
"to_table": "目标表名",
"to_field": "目标字段名",
"relation_type": "one_to_one 或 one_to_many 或 many_to_one 或 many_to_many",
"confidence": "high 或 medium 或 low",
"reason": "推断此关系的理由(中文)"
}
]

关系推断规则(请严格遵守):
1. 优先检查以 "_id" 结尾的字段,这通常是外键。
例如 orders.user_id 通常关联 users.id
2. 检查不同表中名称相同或相似的字段
3. 检查字段类型是否匹配
4. 一对多关系判断:如果 A 表有 xxx_id 指向 B 表的主键,
则 A 多对一 B(即 B 一对多 A)
5. 置信度设定:
- high:明确的 _id 外键关系,且目标表存在对应主键
- medium:字段名相似但不完全匹配
- low:仅基于类型或业务逻辑推测
6. 只输出确定存在的关系,不要猜测不确定的关系

重要:只输出 JSON,不要输出任何其他文字。"""

user_prompt = f"""请分析以下数据库表结构:

{json.dumps(tables_info, ensure_ascii=False, indent=2)}"""

result = await self.generate(user_prompt, system_prompt=system_prompt)
# 清洗响应并解析 JSON ...

💡 Prompt Design Tips

Technique Example Effect
Role Setting “You are a senior database analyst” Makes the LLM analyze from a professional perspective
Structured Output Constraint Force JSON format + provide template Reduces format errors from 30% → <5%
Rule Clarification List 6 relationship inference rules Improves relationship recognition accuracy
Confidence Levels high/medium/low three levels Facilitates downstream filtering of low-confidence results
Negative Constraint “Only output relationships you are sure exist” Reduces false positives

2.3 LLM Response Example

Input (simplified version of 2 tables):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[
{
"table_name": "orders",
"fields": [
{"field_name": "id", "field_type": "int(11)"},
{"field_name": "user_id", "field_type": "int(11)"},
{"field_name": "total_amount", "field_type": "decimal(10,2)"}
]
},
{
"table_name": "users",
"fields": [
{"field_name": "id", "field_type": "int(11)"},
{"field_name": "username", "field_type": "varchar(50)"},
{"field_name": "email", "field_type": "varchar(100)"}
]
}
]

Output (JSON returned by LLM):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
"tables": {
"orders": {
"business_name": "订单表",
"business_description": "存储系统中所有的订单记录,包括订单金额、状态和创建时间等信息",
"fields": {
"id": {
"business_name": "订单ID",
"business_description": "订单的唯一标识符,主键自增"
},
"user_id": {
"business_name": "用户ID",
"business_description": "下单用户的ID,外键关联到 users 表"
},
"total_amount": {
"business_name": "订单总金额",
"business_description": "该订单的商品总价,单位为元"
}
}
},
"users": {
"business_name": "用户表",
"business_description": "存储平台注册用户的基本信息",
"fields": {
"id": {
"business_name": "用户ID",
"business_description": "用户的唯一标识符"
},
"username": {
"business_name": "用户名",
"business_description": "用户的登录名称,唯一"
},
"email": {
"business_name": "电子邮箱",
"business_description": "用户注册时绑定的邮箱地址"
}
}
}
},
"suggested_relations": [
{
"from_table": "orders",
"from_field": "user_id",
"to_table": "users",
"to_field": "id",
"relation_type": "many_to_one",
"confidence": "high",
"reason": "orders.user_id 以 _id 结尾且类型为 INT,users.id 是主键,典型的外键关系"
}
]
}

Perfect! 🎉 The LLM automatically completed:

  • ✅ Chinese-to-English mapping (orders订单表)
  • ✅ Business description generation
  • ✅ Foreign key relationship inference (orders.user_idusers.id)
  • ✅ Confidence marking (high)

2.4 Parsing the Response and Writing Back to Database

app/services/metadata.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
async def analyze_with_llm(db: AsyncSession, datasource_id: int):
# Step 1: 从 SQLite 读取原始元数据
tables = await get_all_tables(db, datasource_id)

# Step 2: 构建 LLM 输入
tables_info = []
for table in tables:
fields = await get_fields_for_table(db, table.id)
tables_info.append({
"table_name": table.table_name,
"table_comment": table.table_comment,
"fields": [{"field_name": f.field_name, ...} for f in fields]
})

# Step 3: 调用 LLM Gateway
gateway = LLMGateway()
analysis = await gateway.analyze_metadata(tables_info)

# Step 4: 解析并更新 TableMeta / FieldMeta
for table in tables:
table_analysis = analysis.get("tables", {}).get(table.table_name, {})
if table_analysis.get("business_name"):
table.business_name = table_analysis["business_name"]
if table_analysis.get("business_description"):
table.business_description = table_analysis["business_description"]

for field in table.fields:
field_analysis = table_analysis.get("fields", {}).get(field.field_name, {})
if field_analysis.get("business_name"):
field.business_name = field_analysis["business_name"]

# Step 5: 处理建议的关系(可选:保存或仅展示供人工确认)
suggested_relations = analysis.get("suggested_relations", [])
resolved_relations = []
for rel in suggested_relations:
from_table = find_table(tables, rel["from_table"])
to_table = find_table(tables, rel["to_table"])
if from_table and to_table:
resolved_relations.append({
"from_table_id": from_table.id,
"to_table_id": to_table.id,
"relation_type": rel["relation_type"],
"confidence": rel["confidence"],
"reason": rel["reason"],
})

await db.flush()
return {"analyzed_count": len(tables), "suggested_relations": resolved_relations}

🔗 Detailed Relationship Inference Algorithm

3.1 Rule Engine (Defined in Prompt)

We clearly tell the LLM the inference rules in the System Prompt:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Rule 1: _id suffix detection
IF field_name.endswith("_id") AND target_table.has_primary_key("id")
THEN → foreign key relationship (confidence: high)

Rule 2: Semantic similarity matching
IF field_name in ["user_id", "uid"] AND target_table.name == "users"
THEN → possible foreign key (confidence: medium)

Rule 3: Type consistency check
IF source_field.type == target_pk.type
THEN → increase confidence

Rule 4: Naming convention inference
IF source_table == "order_items" AND field == "product_id"
THEN → many-to-one relationship order_items → products

3.2 Actual Performance Statistics

On our test set (e-commerce database, 15 tables):

Metric Value Description
Table naming accuracy 100% (15/15) All tables received reasonable Chinese names
Field naming accuracy 95% (142/150) A few abbreviations (e.g., sku) need manual correction
Relationship recall 90% (9/10) Found 9 real foreign key relationships
Relationship precision 88.9% (8/9) 1 false positive (misjudged category_id as a relation)
Average time 4.2 seconds 15 tables, ~200 fields

💡 Typical False Positive Example

1
2
3
4
5
❌ False positive: products.category_id → categories.id (confidence: medium)
Reason: LLM thought category_id should be a foreign key, but our system actually
uses a string type category_code rather than an integer ID

✅ Solution: A human review interface allows users to confirm/modify relationships

🏗️ Phase 3: Semantic Model Assembly

4.1 get_semantic_model() — Core Query Function

app/services/metadata.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
async def get_semantic_model(db: AsyncSession, datasource_id: int) -> dict:
"""
Assemble the complete semantic model for use by SQL Generator and Chat Engine.

Returns:
{
"datasource_id": 1,
"tables": [
{
"table_name": "orders",
"table_comment": "",
"business_name": "订单表",
"fields": [
{
"field_name": "id",
"field_type": "int(11)",
"field_comment": "",
"business_name": "订单ID",
"is_primary_key": True,
"is_foreign_key": False,
},
...
]
},
...
],
"relations": [
{
"from_table": "orders",
"to_table": "users",
"from_field": "user_id",
"to_field": "id",
"relation_type": "many_to_one",
},
...
]
}
"""
# 查询所有表
tables_result = await db.execute(
select(TableMeta).where(TableMeta.datasource_id == datasource_id)
)
tables = list(tables_result.scalars().all())

tables_data = []
for table in tables:
# 查询每个表的所有字段
fields_result = await db.execute(
select(FieldMeta).where(FieldMeta.table_meta_id == table.id)
)
fields = list(fields_result.scalars().all())

tables_data.append({
"table_name": table.table_name,
"table_comment": table.table_comment or "",
"business_name": table.business_name or "", # LLM 标注的中文名
"fields": [
{
"field_name": f.field_name,
"field_type": f.field_type,
"field_comment": f.field_comment or "",
"business_name": f.business_name or "", # LLM 标注的字段中文名
"is_primary_key": f.is_primary_key,
"is_foreign_key": f.is_foreign_key,
}
for f in fields
],
})

# 查询表关系
relations_result = await db.execute(
select(TableRelation).where(...)
)
relations = list(relations_result.scalars().all())

return {
"datasource_id": datasource_id,
"tables": tables_data,
"relations": [
{
"from_table": r.from_table_name,
"to_table": r.to_table_name,
"from_field": r.from_field,
"to_field": r.to_field,
"relation_type": r.relation_type,
}
for r in relations
],
}

4.2 Usage Scenarios for Semantic Model

Scenario 1: Injected into SQL Generator’s Prompt

1
2
3
4
5
6
7
# sql_generator.py
semantic_model = await get_semantic_model(db, datasource_id)

sql = await llm_gateway.generate_sql(
natural_language="查询最近7天的销售额",
semantic_model=semantic_model, # ← inject here!
)

Context received by LLM (partial excerpt):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
数据库语义模型:
{
"tables": [
{
"table_name": "orders",
"business_name": "订单表",
"fields": [
{"field_name": "id", "business_name": "订单ID", ...},
{"field_name": "total_amount", "business_name": "订单总金额", ...},
{"field_name": "created_at", "business_name": "创建时间", ...}
]
},
...
],
"relations": [
{"from_table": "orders", "from_field": "user_id",
"to_table": "users", "to_field": "id", "relation_type": "many_to_one"},
...
]
}

用户的自然语言查询:查询最近7天的销售额

Result: The LLM knows to query orders.total_amount and filter by created_at! ✅

Scenario 2: Injected into Chat Engine’s System Prompt

1
2
3
4
5
6
7
8
9
10
11
12
# chat_engine.py
semantic_model = await get_semantic_model(db, session.datasource_id)

system_prompt = f"""你是一个专业的数据分析助手。
你可以访问以下数据库结构:

{json.dumps(semantic_model, ensure_ascii=False, indent=2)}

当用户提出数据相关问题时,请按以下格式回复:
1. 简要说明你将要查询什么数据
2. 用 ```sql 代码块包裹生成的 SQL
3. 尽量提供有价值的分析和洞察"""

🎨 Frontend Management Interface

5.1 Metadata Overview Page (MetadataOverview.vue)

Features:

  • 📋 Display all tables and their field counts
  • 🏷️ Show LLM-annotated business names
  • 🔄 One-click trigger for LLM re-analysis
  • ⚠️ Mark unanalyzed tables (grayed out)

5.2 Table Detail Page (TableDetail.vue)

Features:

  • ✏️ Edit Chinese business names and descriptions for tables/fields
  • 📝 Supplement information not recognized by LLM (e.g., enumeration meanings)
  • 👁️ View sample data (sample_values)
  • 🗑️ Delete unnecessary fields (hide rather than physical delete)

5.3 Relation Editor (RelationEditor.vue)

Features:

  • 🔗 Graphically display table relationships (nodes + edges)
  • ✅ Confirm/reject LLM-suggested relationships
  • ➕ Manually add missing relationships
  • ✏️ Correct relationship types (one_to_many ↔ many_to_many)

📊 Performance Optimization Strategies

6.1 Cache Semantic Model

Since the semantic model is relatively stable (not frequently changed), it can be cached:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from functools import lru_cache
import hashlib

_cache = {}

def get_semantic_model_cached(db, datasource_id):
cache_key = f"semantic_model_{datasource_id}"

if cache_key in _cache:
cached_time, cached_data = _cache[cache_key]
if time.time() - cached_time < 300: # 5 分钟缓存
return cached_data

data = asyncio.run(get_semantic_model(db, datasource_id))
_cache[cache_key] = (time.time(), data)
return data

6.2 Incremental Analysis

For large databases (100+ tables), only analyze new/modified tables:

1
2
3
4
5
6
7
8
9
async def incremental_analyze(db, datasource_id):
all_tables = await get_all_tables(db, datasource_id)
unanalyzed = [t for t in all_tables if not t.analyzed]

if len(unanalyzed) > 0:
# 只将未分析的表发送给 LLM
tables_info = build_tables_info(unanalyzed)
result = await llm_gateway.analyze_metadata(tables_info)
# 更新这部分表的标注...

6.3 Asynchronous Analysis

LLM analysis takes a long time (3-8s), so it should be executed asynchronously:

1
2
3
4
5
6
7
8
9
10
11
@router.post("/api/metadata/{ds_id}/analyze")
async def trigger_analysis(ds_id: int, db: AsyncSession = Depends(get_db)):
"""触发异步分析任务"""
task_id = str(uuid.uuid4())
asyncio.create_task(analyze_with_llm(db, ds_id))
return {"task_id": task_id, "status": "processing"}

@router.get("/api/metadata/{ds_id}/analyze/status")
async def check_status(task_id: str):
"""检查分析进度"""
return get_task_status(task_id) # pending / completed / failed

🔟 Summary and Best Practices

✅ What We Achieved

  1. Fully automated metadata import: Connect to MySQL and extract table structure (no manual entry required)
  2. LLM intelligent annotation: Automatically add Chinese names, business descriptions, and infer relationships (saves 90% manual effort)
  3. High-quality semantic model: Provides precise context for NL→SQL conversion
  4. Human-machine collaboration: LLM suggestions can be manually reviewed and corrected

📚 Best Practices Checklist

  • Use SHOW TABLES + DESCRIBE to automatically import metadata
  • Store sample data to help LLM understand field meanings
  • Design structured prompts to force JSON output
  • Define clear relationship inference rules and confidence levels
  • Implement get_semantic_model() as a unified query interface
  • Provide frontend management interface to support manual corrections
  • Cache semantic model to reduce redundant calculations
  • Support asynchronous analysis to avoid blocking requests

🚀 Next Steps

In the next article, we will dive deep into the NL→SQL conversion engine — How to use the semantic model to generate accurate SQL? How to prevent SQL injection? How to handle complex multi-table JOINs?

Stay tuned! 🚀


Related Code Files:

  • app/services/metadata.py — Metadata CRUD and LLM analysis
  • app/services/llm_gateway.py — LLM analysis prompts
  • app/models/metadata.py — Data model definitions
  • app/api/metadata.py — API interface layer