✅ Tells the LLM that categories table corresponds to the Chinese name “分类表”
✅ Tells the LLM that order_items.order_id relates to orders.id
✅ Tells the LLM that total_amount is of decimal type (can be SUMmed)
1.3 Optional: Formula Library (Formulas)
Predefined SQL templates for complex calculations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
formulas = [ { "code": "yoy_growth", "name": "同比增长率", "description": "Calculate the growth rate compared to the same period last year", "template": "SELECT ... DATE_SUB(CURDATE(), INTERVAL 1 YEAR) ..." }, { "code": "top_n_ranking", "name": "TOP N 排名", "description": "Sort by a metric and take the top N rows", "template": "SELECT ... ORDER BY {metric} DESC LIMIT {n}" } ]
asyncdefgenerate_query( natural_language: str, semantic_model: dict, formulas: Optional[list[dict]] = None, ) -> str: """ Core Entry: converts natural language to SQL
Args: natural_language: User input natural language semantic_model: Database semantic model (from Metadata Service) formulas: Available formula list (optional)
Returns: Cleaned SQL string """ from app.services.llm_gateway import LLMGateway
gateway = LLMGateway()
# Assemble context context = semantic_model.copy() if formulas: context["available_formulas"] = [ { "code": f["code"], "name": f["name"], "description": f["description"] } for f in formulas ]
asyncdefgenerate_sql(self, natural_language: str, semantic_model: dict) -> str: # System Prompt: Define role and constraints system_prompt = """You are a professional SQL generation expert. Based on the user's natural language query and the database semantic model, generate a valid MySQL SQL query statement.
Rules: 1. Output only the SQL statement, do not output any explanatory text. 2. Use the correct table names and field names from the semantic model. 3. When a join query is needed, use JOIN based on the table relationships. 4. Add WHERE, GROUP BY, ORDER BY and other clauses as needed. 5. If the user does not specify an order, default to descending order by the primary key or time field."""
# User Prompt: Inject semantic model and user query user_prompt = f"""Database Semantic Model: {json.dumps(semantic_model, ensure_ascii=False, indent=2)}
User's Natural Language Query: {natural_language}
Please generate SQL:"""
result = awaitself.generate(user_prompt, system_prompt=system_prompt)
# Response cleaning (done inside LLM Gateway) result = result.strip()
# Remove DeepSeek's <think> tags think_match = re.search(r'</think\s*>', result, re.DOTALL) if think_match: result = result[think_match.end():].strip()
# Strip code block markers if result.startswith("```sql"): result = result[6:] if result.startswith("```"): result = result[3:] if result.endswith("```"): result = result[:-3]
return result.strip()
3.2 Prompt Design Strategy Analysis
Strategy
Example
Effect
Role Setting
“You are a professional SQL generation expert”
Makes LLM generate with a rigorous attitude
Negative Constraint
“Output only SQL, no explanation”
Reduces noise (from 80% → <10%)
Positive Guidance
“Use correct table and field names from the semantic model”
Improves field name accuracy (90%+)
JOIN Hint
“When needed, use JOIN based on table relationships”
Automatically utilizes relations information
Fallback Strategy
“Default to descending order by primary key or time”
Avoids unordered results
3.3 Real Case: Complex Multi-Table Query
User Input:
1
"Query the top 5 product categories by sales last month and their percentage"
[System] You are a professional SQL generation expert. Based on the user's natural language query and the database semantic model, generate a valid MySQL SQL query statement. Rules: 1. Output only the SQL statement, do not output any explanatory text. 2. Use the correct table names and field names from the semantic model. 3. When a join query is needed, use JOIN based on the table relationships. ...
<think> User wants to query the top 5 product categories by sales last month and their percentage.
Needs: 1. Time range: last month (2026-04-01 ~ 2026-04-30) 2. Tables involved: order_items, products, categories 3. Calculation logic: SUM(amount) grouped by category, then calculate percentage 4. Sorting: descending sales, take TOP 5
SQL structure: - FROM order_items - JOIN products ON product_id - JOIN categories ON category_id - WHERE created_at in last month range - GROUP BY category_id, c.name - ORDER BY total_sales DESC LIMIT 5 </think>
```sql SELECT c.name AS category_name, SUM(oi.amount) AS total_sales, ROUND(SUM(oi.amount) / (SELECT SUM(amount) FROM order_items oi2 JOIN orders o2 ON oi2.order_id = o2.id WHERE o2.created_at >= '2026-04-01' AND o2.created_at < '2026-05-01') * 100, 2) AS percentage FROM order_items oi JOIN products p ON oi.product_id = p.id JOIN categories c ON p.category_id = c.id JOIN orders o ON oi.order_id = o.id WHERE o.created_at >= '2026-04-01' AND o.created_at < '2026-05-01' GROUP BY c.id, c.name ORDER BY total_sales DESC LIMIT 5;
```sql SELECT c.name AS category_name, SUM(oi.amount) AS total_sales, ROUND(SUM(oi.amount) / (SELECT SUM(amount) FROM order_items oi2 JOIN orders o2 ON oi2.order_id = o2.id WHERE o2.created_at >= '2026-04-01' AND o2.created_at < '2026-05-01') * 100, 2) AS percentage FROM order_items oi JOIN products p ON oi.product_id = p.id JOIN categories c ON p.category_id = c.id JOIN orders o ON oi.order_id = o.id WHERE o.created_at >= '2026-04-01' AND o.created_at < '2026-05-01' GROUP BY c.id, c.name ORDER BY total_sales DESC LIMIT 5;
Perfect! 🎉 It includes:
✅ 4-table JOIN (correctly used the relations information)
✅ Time filter (understood “last month”)
✅ Grouped aggregation (GROUP BY + SUM)
✅ Percentage calculation (subquery)
✅ TOP N filter (LIMIT 5)
🧹 IV. Step 3: Response Cleaning (Already Implemented in LLM Gateway)
asyncdefexecute_query(sql: str, datasource_id: int) -> list[dict]: """ Asynchronously execute SQL and return a list of dictionaries
Args: sql: Validated SQL statement datasource_id: Target data source ID
Returns: Query result list [{column: value}, ...]
Raises: ValueError: Data source not found Exception: SQL execution error """ import aiomysql from app.database import async_session
# Step 1: Get data source connection info from SQLite asyncwith async_session() as session: ds = await get_datasource(session, datasource_id) ifnot ds: raise ValueError(f"DataSource {datasource_id} not found")
# 3. Store back to database (no LLM call at runtime) query_config["sql"] = sql widget.query_config = json.dumps(query_config, ensure_ascii=False) await db.flush()
return {"widget_id": widget_id, "sql": sql}
📊 VIII. Common Problems and Solutions
Q1: LLM Generated SQL Has Syntax Errors?
Solution A: Auto-fix (Recommended)
1 2 3 4 5 6 7 8 9
import sqlparse
defauto_fix_sql(sql: str) -> str: """Attempt to auto-fix common syntax errors""" try: parsed = sqlparse.parse(sql)[0] returnstr(parsed).strip() except: return sql # Return as is if cannot fix
for i inrange(max_retries): if validate_sql(sql): return sql
# Feed error info back to LLM for correction error_msg = "Generated SQL validation failed, please fix" sql = await generate_query(f"{nl}\n\nPrevious attempt: {sql}\nError: {error_msg}", model)
raise ValueError(f"Failed to generate valid SQL after {max_retries} retries")
Q2: How to Handle Performance Issues with Very Large Tables?
Solution: Add LIMIT Protection
1 2 3 4 5 6 7
defadd_safety_limit(sql: str, max_rows: int = 10000) -> str: """If SQL has no LIMIT, add one automatically""" sql_upper = sql.upper().rstrip(';').rstrip()
Use aiomysql.DictCursor to return dictionary format
Log security-related WARNING messages
Consider self-correction mechanism (retry with feedback)
Add LIMIT protection to prevent large data queries (planned extension, see “Solutions” section)
🚀 Next Steps
In the next article, we will dive into Intelligent Dialogue Engine in Practice — How to manage multi-turn conversation context? How to automatically recommend chart types based on SQL features? How to generate data analysis reports with one click?