asyncdefsave_imported_tables(db: AsyncSession, datasource_id: int, tables_info: list[dict]): # 清理旧数据(支持重新导入) old_tables = await db.execute( select(TableMeta).where(TableMeta.datasource_id == datasource_id) ) old_table_ids = [t.idfor t in old_tables.scalars().all()] if old_table_ids: await db.execute(delete(FieldMeta).where(FieldMeta.table_meta_id.in_(old_table_ids))) await db.execute(delete(TableRelation).where(...)) await db.execute(delete(TableMeta).where(TableMeta.id.in_(old_table_ids)))
# 批量写入新数据 for table_info in tables_info: table_meta = TableMeta( datasource_id=datasource_id, table_name=table_info["table_name"], table_comment=table_info.get("table_comment", ""), ) db.add(table_meta) await db.flush()
for col in table_info.get("columns", []): field = FieldMeta( table_meta_id=table_meta.id, field_name=col["field_name"], field_type=col["field_type"], is_primary_key=col.get("is_primary_key", False), is_foreign_key=col.get("is_foreign_key", False), sample_values=json.dumps(col.get("sample_values", [])), ) db.add(field)
await db.flush()
💡 Design Highlight: Incremental vs Full Replacement
We chose the full replacement strategy (delete-then-insert) for these reasons:
Strategy
Pros
Cons
Full Replacement (Current)
Simple implementation; avoids dirty data
Slightly slower for large tables (but metadata volume is negligible)
Incremental Update
Preserves manual modifications
Requires complex diff logic
Rationale: Metadata usually consists of just dozens of tables and hundreds of fields; full replacement takes <100ms.
# Step 4: 解析并更新 TableMeta / FieldMeta for table in tables: table_analysis = analysis.get("tables", {}).get(table.table_name, {}) if table_analysis.get("business_name"): table.business_name = table_analysis["business_name"] if table_analysis.get("business_description"): table.business_description = table_analysis["business_description"]
for field in table.fields: field_analysis = table_analysis.get("fields", {}).get(field.field_name, {}) if field_analysis.get("business_name"): field.business_name = field_analysis["business_name"]
# Step 5: 处理建议的关系(可选:保存或仅展示供人工确认) suggested_relations = analysis.get("suggested_relations", []) resolved_relations = [] for rel in suggested_relations: from_table = find_table(tables, rel["from_table"]) to_table = find_table(tables, rel["to_table"]) if from_table and to_table: resolved_relations.append({ "from_table_id": from_table.id, "to_table_id": to_table.id, "relation_type": rel["relation_type"], "confidence": rel["confidence"], "reason": rel["reason"], })
We clearly tell the LLM the inference rules in the System Prompt:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Rule 1: _id suffix detection IF field_name.endswith("_id") AND target_table.has_primary_key("id") THEN → foreign key relationship (confidence: high)
Rule 2: Semantic similarity matching IF field_name in ["user_id", "uid"] AND target_table.name == "users" THEN → possible foreign key (confidence: medium)
Rule 3: Type consistency check IF source_field.type == target_pk.type THEN → increase confidence
Rule 4: Naming convention inference IF source_table == "order_items" AND field == "product_id" THEN → many-to-one relationship order_items → products
3.2 Actual Performance Statistics
On our test set (e-commerce database, 15 tables):
Metric
Value
Description
Table naming accuracy
100% (15/15)
All tables received reasonable Chinese names
Field naming accuracy
95% (142/150)
A few abbreviations (e.g., sku) need manual correction
Relationship recall
90% (9/10)
Found 9 real foreign key relationships
Relationship precision
88.9% (8/9)
1 false positive (misjudged category_id as a relation)
Average time
4.2 seconds
15 tables, ~200 fields
💡 Typical False Positive Example
1 2 3 4 5
❌ False positive: products.category_id → categories.id (confidence: medium) Reason: LLM thought category_id should be a foreign key, but our system actually uses a string type category_code rather than an integer ID
✅ Solution: A human review interface allows users to confirm/modify relationships
asyncdefget_semantic_model(db: AsyncSession, datasource_id: int) -> dict: """ Assemble the complete semantic model for use by SQL Generator and Chat Engine.
Fully automated metadata import: Connect to MySQL and extract table structure (no manual entry required)
LLM intelligent annotation: Automatically add Chinese names, business descriptions, and infer relationships (saves 90% manual effort)
High-quality semantic model: Provides precise context for NL→SQL conversion
Human-machine collaboration: LLM suggestions can be manually reviewed and corrected
📚 Best Practices Checklist
Use SHOW TABLES + DESCRIBE to automatically import metadata
Store sample data to help LLM understand field meanings
Design structured prompts to force JSON output
Define clear relationship inference rules and confidence levels
Implement get_semantic_model() as a unified query interface
Provide frontend management interface to support manual corrections
Cache semantic model to reduce redundant calculations
Support asynchronous analysis to avoid blocking requests
🚀 Next Steps
In the next article, we will dive deep into the NL→SQL conversion engine — How to use the semantic model to generate accurate SQL? How to prevent SQL injection? How to handle complex multi-table JOINs?
Stay tuned! 🚀
Related Code Files:
app/services/metadata.py — Metadata CRUD and LLM analysis