LLM Unified Gateway Design — LiteLLM Abstraction, Prompt Engineering, and Response Cleaning

🎯 Core Chapter Questions

In AI application development, how to elegantly call LLM APIs? Hard-coding with requests.post()? Or writing a set of calling logic for each business module?

Specific challenges we face:

❌ Model lock-in risk: Use DeepSeek today, want to switch to GPT-4 tomorrow, need to modify dozens of code locations
❌ Unstable output: LLMs occasionally return code wrapped in ````sql, ` tags, or extra explanatory text
❌ Token limit crashes: Semantic model too large triggers context_length_exceeded error
❌ Difficult debugging: Hard to know what Prompt the LLM received, what it returned, and how long it took

Solution: Build a unified LLM Gateway to converge all LLM interactions into one module.

🏗️ Architecture Overview

LLM Unified Gateway Architecture

LLM Unified Gateway Architecture

Design Principles

LLM Gateway Design Principles

Single Responsibility, Open/Closed Principle, Observability & Defensive Programming

🔌 1. LiteLLM Unified Abstraction Layer

Why not use OpenAI SDK directly?

Comparison Dimension	OpenAI SDK	LiteLLM
Multi-model support	❌ OpenAI only	✅ DeepSeek/OpenAI/Claude/Gemini, 100+
Unified interface	❌ Different parameters per provider	✅ Unified `acompletion()` interface
Error standardization	❌ Different error codes per provider	✅ Unified exception types
Retry mechanism	❌ Need to implement yourself	✅ Built-in exponential backoff retry
Cost tracking	❌ None	✅ Built-in token usage statistics

Core Implementation

app/services/llm_gateway.py

import litellm

class LLMGateway:
    def __init__(self):
        self.provider = settings.LLM_PROVIDER    # "deepseek"
        self.model = settings.LLM_MODEL          # "deepseek-chat"
        self.api_key = settings.LLM_API_KEY
        self.api_base = settings.LLM_API_BASE    # "https://api.deepseek.com"

    def _get_model_name(self, model=None):
        """Format model name to fit LiteLLM"""
        if model:
            return model
        if "deepseek" in self.model.lower():
            return f"deepseek/{self.model}"   # → "deepseek/deepseek-chat"
        return f"openai/{self.model}"         # → "openai/gpt-4o"

💡 Key Design: Model Name Formatting

LiteLLM requires provider/model format:

# DeepSeek format
"deepseek/deepseek-chat"       # Corresponds to DeepSeek-V3
"deepseek/deepseek-coder"      # Corresponds to DeepSeek-Coder

# OpenAI format
"openai/gpt-4o"
"openai/gpt-4-turbo"

# Custom endpoint (private deployments compatible with OpenAI API)
"openai/my-custom-model"       # Used with api_base

Cost of switching model = modifying 1 line in .env config 🎉

📝 2. Prompt Engineering Practices

2.1 System Prompt Template Library

Our system has 4 different LLM calling scenarios, each requiring a carefully designed System Prompt:

Scenario 1: SQL Generation (most complex)

app/services/llm_gateway.py

async def generate_sql(self, natural_language, semantic_model):
    system_prompt = """You are a professional SQL generation expert. Based on the user's natural language query and the database semantic model,
generate valid MySQL SQL query statements.

Rules:
1. Output only the SQL statement, do not output any explanatory text.
2. Use the correct table names and field names from the semantic model.
3. When related queries are needed, use JOIN based on table relationships.
4. Add WHERE, GROUP BY, ORDER BY clauses as needed.
5. If the user does not explicitly specify sorting, default to descending order by primary key or time field."""

    user_prompt = f"""Database Semantic Model:
{json.dumps(semantic_model, ensure_ascii=False, indent=2)}

User's natural language query: {natural_language}

Please generate SQL:"""

Design Points:

✅ Constraint instructions: “Output only SQL”, “No explanation” → reduce noise
✅ Context injection: Pass the complete semantic model as context
✅ Clear rules: Default strategies for sorting, JOIN, etc.

Scenario 2: Metadata Analysis (Structured Output)

app/services/llm_gateway.py

async def analyze_metadata(self, tables_info):
    system_prompt = """You are an experienced database analyst. Please analyze the given database table structures,
and complete the following tasks:
1. Give each table an easy-to-understand Chinese name (business name)
2. Write a Chinese business description for each table
3. Give each field a Chinese business name
...

Return the following JSON structure:
{
  "tables": {
    "table_name": {
      "business_name": "Chinese business name",
      ...
    }
  },
  "suggested_relations": [
    {
      "from_table": "source table name",
      "relation_type": "one_to_one or one_to_many",
      "confidence": "high or medium or low",
      ...
    }
  ]
}

Important: Output only JSON, do not output any other text."""

Design Points:

✅ Forced JSON output: Easy for program parsing
✅ Confidence field: Mark the confidence of relationship inference (high/medium/low)
✅ Example-driven: Guide output format through complete JSON template

2.2 Prompt Injection Techniques

Variable Templating

# ❌ Bad practice: String concatenation error-prone
prompt = "Database has these tables:" + str(tables) + "User asks:" + question

# ✅ Good practice: f-string + json.dumps ensures formatting
user_prompt = f"""Database Semantic Model:
{json.dumps(semantic_model, ensure_ascii=False, indent=2)}  # Indented for readability

User's natural language query: {natural_language}

Please generate SQL:"""

Few-shot Examples (Optional Enhancement)

For complex tasks, append examples to the User Prompt:

user_prompt += """

Example:
User input: "Query the top 10 products by sales amount"
Output:
SELECT p.name, SUM(oi.amount) as total_sales
FROM products p
JOIN order_items oi ON p.id = oi.product_id
GROUP BY p.id
ORDER BY total_sales DESC
LIMIT 10;

Now generate SQL according to the above rules:"""

🔒 3. Length Control Mechanism

Problem Background

When a database has many tables (50+) and many fields (200+), the semantic model JSON may exceed 80K characters, directly causing LLM API errors:

1	`Error: This model's maximum context length is 65536 tokens...`

Solution: Intelligent Truncation

app/services/llm_gateway.py

async def generate(self, prompt, system_prompt=None, model=None):
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": prompt})

    # ⚡ Key: Length check and truncation
    total_chars = sum(len(m["content"]) for m in messages)
    if total_chars > 80000:
        logger.warning(f"[LLM Request] prompt too long ({total_chars} chars), truncating")
        for m in messages:
            if len(m["content"]) > 60000:
                m["content"] = m["content"][:60000] + "\n... (content truncated due to length)"

    # Call LLM
    response = await litellm.acompletion(
        model=self._get_model_name(model),
        messages=messages,
        api_key=self.api_key,
        api_base=self.api_base,
    )
    return response.choices[0].message.content

Truncation Strategy Explanation

Parameter	Threshold	Description
Total character limit	80,000	~20K tokens (conservative estimate)
Single message limit	60,000	Prevents one message from occupying all space
Truncation position	60,000 characters	Keep the first part (usually contains most important info)
Truncation hint	Append “… (content truncated due to length)”	Let LLM know data is incomplete

💡 Why truncate instead of compress?

Approach	Pros	Cons
Truncation (current)	Simple to implement; preserves original data format	May lose important trailing info
Summarization (LLM first summarize)	Retains key information	Adds an extra LLM call; doubles latency
Chunk + multiple calls	No data loss	Complex; needs result merging

Our choice: For most scenarios (<30 tables), truncation won’t be triggered; even if triggered, the first 60K chars usually cover the core table structure.

🧹 4. Response Cleaner

Problem Background

Different LLMs have vastly different output styles:

LLM	Typical Output Example
DeepSeek-V3	`\n\n<think>\nUser wants to query...\n</think>\n\n\```sql\nSELECT …\n```\n\nOf course, you can also…`
GPT-4o	`Here's the SQL query:\n\```sql\nSELECT …\n```\n\nThis query will…`
Claude	`Certainly! Here's the query:\n\```\nSELECT …\n```\n\nLet me explain…`

Common feature: All contain noise (thought process, code block markers, explanatory text)

Cleaning Process

app/services/llm_gateway.py (generate_sql method)

result = await self.generate(user_prompt, system_prompt=system_prompt)
result = result.strip()

# Step 1: Remove <think> tags (DeepSeek specific)
think_match = re.search(r'</think\s*>', result, re.DOTALL)
if think_match:
    result = result[think_match.end():].strip()  # Keep content after </think>

# Step 2: Remove ```sql opening marker
if result.startswith("```sql"):
    result = result[6:]

# Step 3: Remove ``` opening marker (generic code block)
if result.startswith("```"):
    result = result[3:]

# Step 4: Remove ``` closing marker
if result.endswith("```"):
    result = result[:-3]

return result.strip()

Additional Cleaning for JSON Output

app/services/llm_gateway.py (analyze_metadata method)

# Extract JSON (may be wrapped in ```json)
json_match = re.search(r'\{[\s\S]*\}', result)
if json_match:
    result = json_match.group(0)

# Strip code block markers
if result.startswith("```json"):
    result = result[7:]
if result.startswith("```"):
    result = result[3:]
if result.endswith("```"):
    result = result[:-3]

# Try parsing and validation
try:
    parsed = json.loads(result.strip())
    logger.info(f"[analyze_metadata] parsed successfully, tables={len(parsed.get('tables', {}))}")
    return parsed
except json.JSONDecodeError as e:
    logger.error(f"[analyze_metadata] JSON parse failed: {e}")
    return {}  # Return empty dict instead of crashing, ensuring system stability

📊 5. Observability: Logging System

Logging Strategy

app/services/llm_gateway.py

logger = logging.getLogger("llm_gateway")

async def generate(self, prompt, system_prompt=None, ...):
    # Log request info
    logger.info(f"[LLM Request] model={kwargs['model']}, "
                f"api_base={self.api_base}, "
                f"total_chars={total_chars}")
    logger.info(f"[LLM Request] system_prompt (first 500):\n{(system_prompt or '')[:500]}")
    logger.info(f"[LLM Request] user_prompt (first 1000 chars):\n{prompt[:1000]}")

    start_time = time.time()
    try:
        response = await litellm.acompletion(**kwargs)
        elapsed = time.time() - start_time
        content = response.choices[0].message.content

        # Log response info
        logger.info(f"[LLM Response] elapsed={elapsed:.2f}s, length={len(content)}")
        logger.info(f"[LLM Response Content] (first 2000 chars):\n{content[:2000]}")
        return content
    except Exception as e:
        elapsed = time.time() - start_time
        logger.error(f"[LLM Error] elapsed={elapsed:.2f}s, "
                     f"error={type(e).__name__}: {e}")
        raise

Sample Log Output

INFO     [LLM Request] model=deepseek/deepseek-chat, api_base=https://api.deepseek.com, total_chars=45230
INFO     [LLM Request] system_prompt (first 500):
You are a professional SQL generation expert. Based on the user's natural language query and the database semantic model,
generate valid MySQL SQL query statements.
Rules:
1. Output only the SQL statement, do not output any explanatory text.
...
INFO     [LLM Request] user_prompt (first 1000 chars):
Database Semantic Model:
{
  "tables": [
    {
      "table_name": "orders",
      "fields": [...]
    }
  ]
}
User's natural language query: Query the total order amount for the last 7 days
...
INFO     [LLM Response] elapsed=2.34s, length=187
INFO     [LLM Response Content] (first 2000 chars):
```sql
SELECT DATE(created_at) as date, SUM(total_amount) as total
FROM orders
WHERE created_at >= DATE_SUB(CURDATE(), INTERVAL 7 DAY)
GROUP BY DATE(created_at)
ORDER BY date DESC;


#### 💡 Value of Logs

| Scenario | How to Use Logs |
|------|-------------|
| **Debug Prompt quality** | Check `user_prompt` to confirm semantic model is correctly injected |
| **Performance optimization** | Use `elapsed` to locate slow requests (usually >5s needs optimization) |
| **Cost control** | Count `total_chars` to estimate token usage |
| **Error troubleshooting** | View complete `[LLM Error]` stack trace |

---

## 🔧 6. Unified Interface Design

### Methods Exposed by Gateway

| Method Name | Caller | Input | Output | Purpose |
|--------|--------|------|------|------|
| `generate(prompt, system_prompt)` | Chat Engine | text | text | General conversation |
| `chat_completion(messages)` | Chat Engine | message array | text | Multi-turn dialog |
| `generate_sql(nl, semantic_model)` | SQL Generator | NL + model | SQL string | NL→SQL |
| `analyze_metadata(tables_info)` | Metadata Service | table structure list | JSON | Metadata analysis |
| `generate_dashboard_config(desc, model)` | Dashboard Engine | description + model | JSON array | Dashboard configuration |
| `generate_report_script(data, metrics)` | Report Generator | data + metrics | Markdown | Report generation |

### Call Examples

#### Example 1: SQL Generator Call

```python
# sql_generator.py
from app.services.llm_gateway import LLMGateway

async def generate_query(natural_language, semantic_model):
    gateway = LLMGateway()
    sql = await gateway.generate_sql(natural_language, semantic_model)
    return sql

generate_sql internal call chain

generate_sql → generate → litellm.acompletion → cleaned SQL

Example 2: Chat Engine Call

# chat_engine.py
gateway = LLMGateway()
assistant_content = await gateway.chat_completion([
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": conversation_text},
])

⚠️ 7. Exception Handling Strategy

Exception Classification and Handling

try:
    response = await litellm.acompletion(**kwargs)
except litellm.exceptions.AuthenticationError as e:
    # API Key invalid
    logger.error(f"[LLM] Authentication failed: {e}")
    raise ValueError("LLM API Key is invalid, please check configuration")
except litellm.exceptions.RateLimitError as e:
    # Request rate exceeded
    logger.warning(f"[LLM] Rate limited: {e}")
    raise Exception("LLM requests too frequent, please try again later")
except litellm.exceptions.ContextWindowExceededError as e:
    # Context too long (though we have length control, as a safety measure)
    logger.error(f"[LLM] Context window exceeded: {e}")
    raise Exception("Input content too long, please simplify query conditions")
except Exception as e:
    # Other unknown errors
    logger.error(f"[LLM] Unknown error: {type(e).__name__}: {e}")
    raise Exception(f"LLM service exception: {str(e)}")

💡 Defensive Programming Principles

Never trust LLM output: Always clean and validate
Graceful degradation: Return empty values instead of crashing (e.g., JSON parse failure returns {})
Detailed logging: Record enough information for post-mortem analysis
Timeout control: Although LiteLLM has built-in timeout, recommend setting a timeout at the business layer as well

📈 8. Performance Optimization Tips

8.1 Concurrent Calls

When a dashboard has multiple widgets to parse, LLM calls can be made concurrently:

import asyncio

async def resolve_dashboard_queries(db, dashboard_id):
    tasks = []
    for widget in dashboard.widgets:
        task = asyncio.create_task(
            resolve_nl_to_sql(db, widget.data_source_id, widget.nl_query)
        )
        tasks.append(task)

    results = await asyncio.gather(*tasks, return_exceptions=True)
    # 10 widgets concurrently: 10s → ~3s (assuming each takes 2-3s)

8.2 Cache Semantic Model

The semantic model (table structure + relationships) is relatively stable and can be cached to avoid repeated transmission:

from functools import lru_cache
import hashlib

@lru_cache(maxsize=10)
def get_semantic_model_cache(datasource_id: int, hash_value: str):
    """Cache semantic model (based on datasource_id + content hash)"""
    return _build_semantic_model(datasource_id)

def cache_key(datasource_id, model_dict):
    return (datasource_id, hashlib.md5(json.dumps(model_dict).encode()).hexdigest())

8.3 Streaming Output (Future Optimization)

Currently using synchronous waiting mode, future can switch to streaming output for better UX:

# Pseudocode: streaming response
async def stream_generate(prompt):
    async for chunk in litellm.acompletion_stream(...):
        yield chunk.choices[0].delta.content  # Return token by token

🎯 9. Practical Case Demonstration

Case: From Natural Language to Executable SQL

Input:

1	`User query: "Query the sales ranking of each category last month"`

Step 1 - Build Prompt:

{
  "system": "You are a professional SQL generation expert...",
  "user": "Database Semantic Model: {...45000-character JSON...}\n\nUser query:..."
}

Step 2 - Length Check:

1	`✅ Total characters: 45,230 (< 80,000) → No truncation needed`

Step 3 - Call LLM:

1
2
3

⏱️ Time: 2.34s
📍 Model: deepseek/deepseek-chat
💰 Cost: ~¥0.023 (based on 2K tokens)

Step 4 - Raw Response:

\n\n<think>
The user wants to query "sales ranking of each category last month", need to:
1. Filter time range (last month)
2. Group by category
3. Calculate total sales
4. Sort by total sales descending
Involved tables: orders, order_items, products, categories
...</think>

```sql
SELECT c.name AS category_name,
       SUM(oi.amount) AS total_sales
FROM categories c
JOIN products p ON c.id = p.category_id
JOIN order_items oi ON p.id = oi.product_id
JOIN orders o ON o.id = oi.order_id
WHERE o.created_at >= '2026-04-01'
  AND o.created_at < '2026-05-01'
GROUP BY c.id, c.name
ORDER BY total_sales DESC;


**Step 5 - Cleaned Result**:
```sql
SELECT c.name AS category_name,
       SUM(oi.amount) AS total_sales
FROM categories c
JOIN products p ON c.id = p.category_id
JOIN order_items oi ON p.id = oi.product_id
JOIN orders o ON o.id = oi.order_id
WHERE o.created_at >= '2026-04-01'
  AND o.created_at < '2026-05-01'
GROUP BY c.id, c.name
ORDER BY total_sales DESC;

Perfect! 🎉 Can be passed directly to execute_query().

🔟 Summary and Best Practices

✅ What We Achieved

Zero-cost model switching: One config line to switch from DeepSeek to GPT-4
Stable output quality: Multiple cleaning layers ensure clean SQL/JSON
Complete observability: Logs help quickly locate issues
Defensive programming: Length control + exception handling ensure stability

📚 Best Practices Checklist

Use LiteLLM to unify multi-model calls
System Prompt explicitly constrains output format
Implement Prompt length check and truncation
Write response cleaner (remove tags, code block markers)
Log detailed request/response information
Differentiate exception types for different scenarios
Consider concurrent calls for higher throughput
Cache relatively stable context (e.g., semantic model)

🚀 Next Steps

The next article will dive into the Metadata Management System — how to let the LLM automatically understand database structures? How to intelligently infer table relationships? How to build high-quality semantic models?

Stay tuned! 🚀

Related code file: app/services/llm_gateway.py (llm_gateway.py)
Configuration files: app/config.py (config.py), .env.example (env.example)
Dependency library: LiteLLM documentation