Production Deployment and Performance Optimization — Asynchronous Architecture, Caching Strategy, and Monitoring

🎯 Core Chapter Questions

How to safely and efficiently deploy a development environment application to production?

Challenge	Traditional Pain Points	Our Solution
Concurrency	Synchronous blocking, CPU idle during I/O wait	FastAPI + asyncio fully asynchronous architecture
Response Speed	Querying database/calling LLM every time	Three-level cache (Memory → Redis → MySQL)
Observability	Hard to locate issues after problems occur	Prometheus + Grafana + ELK full-link monitoring
Deployment Complexity	Manual configuration, error-prone	Docker Compose one-click deployment
Cost Control	Uncontrollable LLM API call costs	Cache + Token usage monitoring

📐 Architecture Overview

Deployment Architecture: Current Implementation vs Planned Expansion

Deployment Architecture: Current Implementation vs Planned Expansion

Production Deployment Topology

Production Deployment Topology (Planned)

Production Deployment Topology (Planned)

⚡ 1. Asynchronous Architecture Design

1.1 Why Asynchronous?

Synchronous vs Asynchronous Comparison:

# ❌ Synchronous blocking mode (Flask/Django)
@app.route('/api/chat')
def chat():
    # Thread is blocked here, cannot handle other requests
    data = db.query("SELECT ...")          # Waiting for I/O: ~50ms
    result = requests.post(LLM_API, ...)   # Waiting for I/O: ~3000ms
    return jsonify(result)

# ✅ Asynchronous non-blocking mode (FastAPI)
@router.post('/api/chat')
async def chat():
    # Automatically yields CPU on await, processes other requests
    data = await db.execute(select(...))           # Non-blocking: ~50ms
    async with httpx.AsyncClient() as client:
        result = await client.post(LLM_API, ...)  # Non-blocking: ~3000ms
    return JSONResponse(result)

Performance Comparison (4-core 8GB Server)

Scenario	Synchronous (Gunicorn 4 workers)	Asynchronous (Uvicorn)	Improvement
Pure I/O intensive (calling LLM)	50 QPS	200 QPS	4x
Mixed load (DB+LLM+Redis)	120 QPS	450 QPS	3.75x
CPU intensive (data computation)	400 QPS	420 QPS	1.05x
Memory usage	800MB (4 processes)	250MB (single process)	3.2x savings

1.2 Core Asynchronous Components

app/main.py

from fastapi import FastAPI
from contextlib import asynccontextmanager
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
import redis.asyncio as aioredis
import httpx

# Global async engine and session factory
engine = create_async_engine(
    "mysql+aiomysql://user:pass@localhost/db",
    pool_size=20,              # Connection pool size
    max_overflow=10,           # Maximum overflow connections
    pool_recycle=3600,         # Connection recycle time (seconds)
    echo=False,                # Disable SQL logging in production
)
AsyncSessionLocal = sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)

# Global Redis client
redis_client = aioredis.from_url(
    "redis://localhost:6379/0",
    decode_responses=True,
    max_connections=20,
)


@asynccontextmanager
async def lifespan(app: FastAPI):
    """Application lifecycle management"""
    print("🚀 Starting up...")

    # On startup: warm up cache
    await warmup_cache()

    yield

    print("🛑 Shutting down...")
    # On shutdown: clean up resources
    await engine.dispose()
    await redis_client.close()


app = FastAPI(
    title="NLP Data Analysis Platform",
    version="1.0.0",
    lifespan=lifespan,  # Use the new lifecycle management
)


# Dependency injection: get database session
async def get_db() -> AsyncSession:
    async with AsyncSessionLocal() as session:
        try:
            yield session
            await session.commit()
        except Exception:
            await session.rollback()
            raise


# Dependency injection: get Redis client
async def get_redis() -> aioredis.Redis:
    return redis_client

1.3 Concurrent Query Example (Dashboard Data Fetching)

app/api/dashboards.py (detailed in Part 6)

import asyncio

@router.post("/{dashboard_id}/data")
async def get_dashboard_data(dashboard_id: int):
    widgets = await load_widgets(dashboard_id)

    async def fetch_one(widget):
        """Query a single widget's data (with caching logic)"""
        cache_key = f"dash:{dashboard_id}:widget:{widget.id}"
        cached = await redis_client.get(cache_key)

        if cached:
            return widget.id, json.loads(cached)  # L2 hit!

        data = await execute_query(widget.sql_query)
        await redis_client.setex(cache_key, 60, json.dumps(data))  # Backfill L2
        return widget.id, data

    # 🔥 Core: use asyncio.gather for concurrent execution of all queries
    tasks = [fetch_one(w) for w in widgets]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    return dict(results)

💾 2. Multi-layer Caching Strategy

2.1 Three-Level Cache Architecture

Three-Level Cache Architecture (Planned)

Three-Level Cache Architecture (Planned)

2.2 Cache Implementation Code

app/services/cache_service.py

import json
import hashlib
from functools import lru_cache
from typing import Optional, Any
import redis.asyncio as aioredis

redis_client: aioredis.Redis = None


async def init_cache(redis_url: str = "redis://localhost:6379/0"):
    global redis_client
    redis_client = aioredis.from_url(redis_url, decode_responses=True)


def _generate_cache_key(prefix: str, *args, **kwargs) -> str:
    """
    Generate a unique cache key

    Example:
        _generate_cache_key("sql", "SELECT * FROM users", datasource_id=1)
        → "sql:a1b2c3d4e5f6"
    """
    raw = f"{prefix}:{str(args)}:{sorted(kwargs.items())}"
    return f"{prefix}:{hashlib.md5(raw.encode()).hexdigest()[:12]}"


# ===== L1: In-process cache =====

@lru_cache(maxsize=128)
def get_semantic_model_l1(datasource_id: int) -> Optional[dict]:
    """
    L1 cache: semantic model (data that rarely changes)

    Note: lru_cache is synchronous, only used for pure computation or already loaded data.
    For data that requires async retrieval, implement a manual dict cache.
    """
    pass  # Actually managed by the caller


# Global L1 dict cache (supports async)
_l1_cache: dict[str, tuple[Any, float]] = {}  # {key: (value, timestamp)}


def get_from_l1(key: str, ttl: float = 300) -> Optional[Any]:
    """Read from L1 cache"""
    if key in _l1_cache:
        value, ts = _l1_cache[key]
        if time.time() - ts < ttl:
            return value
        else:
            del _l1_cache[key]  # Delete expired entry
    return None


def set_to_l1(key: str, value: Any):
    """Write to L1 cache"""
    _l1_cache[key] = (value, time.time())


# ===== L2: Redis cache =====

async def get_from_l2(key: str) -> Optional[Any]:
    """Read from Redis"""
    if not redis_client:
        return None
    try:
        data = await redis_client.get(key)
        if data:
            return json.loads(data)
    except Exception as e:
        logger.warning(f"Redis GET failed: {e}")
    return None


async def set_to_l2(key: str, value: Any, ttl: int = 60):
    """Write to Redis (with TTL)"""
    if not redis_client:
        return
    try:
        await redis_client.setex(key, ttl, json.dumps(value, default=str))
    except Exception as e:
        logger.warning(f"Redis SET failed: {e}")


async def invalidate_l2(pattern: str):
    """Batch clear matching cache keys (e.g., dash:*)"""
    if not redis_client:
        return
    keys = await redis_client.keys(pattern)
    if keys:
        await redis_client.delete(*keys)


# ===== Unified cache interface =====

async def cached_get(
    prefix: str,
    fetch_func,  # Async function for backfilling
    l1_ttl: float = 300,
    l2_ttl: int = 60,
    *args,
    **kwargs
) -> Any:
    """
    Unified cache reading interface (L1 → L2 → L3 backfill)

    Args:
        prefix: Cache key prefix (e.g., "semantic_model", "query_result")
        fetch_func: Callback function when cache misses (usually a DB query)
        l1_ttl: L1 cache validity period (seconds)
        l2_ttl: L2 cache validity period (seconds)
    """
    cache_key = _generate_cache_key(prefix, *args, **kwargs)

    # Try L1
    result = get_from_l1(cache_key, l1_ttl)
    if result is not None:
        logger.debug(f"L1 cache hit: {cache_key}")
        return result

    # Try L2
    result = await get_from_l2(cache_key)
    if result is not None:
        logger.debug(f"L2 cache hit: {cache_key}")
        set_to_l1(cache_key, result)  # Backfill L1
        return result

    # L3: Backfill (execute actual query)
    logger.info(f"Cache miss, fetching from source: {cache_key}")
    result = await fetch_func()

    # Backfill L1 and L2
    set_to_l1(cache_key, result)
    await set_to_l2(cache_key, result, l2_ttl)

    return result

2.3 Cache Strategy by Scenario

Data Type	L1 TTL	L2 TTL	Invalidation Strategy
Semantic Model	10 minutes	30 minutes	Manual refresh (when structure changes)
Dashboard Configuration	5 minutes	15 minutes	Actively cleared when user edits
Widget Query Results	No cache	60 seconds	Auto-expire via TTL
LLM Dialog Responses	No cache	No cache	Real-time generation (different each time)
User Information	10 minutes	30 minutes	Cleared when user profile updates

📊 3. Monitoring and Alerting System

3.1 Prometheus Metric Collection

app/metrics.py

from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
from functools import wraps

# ===== Custom metric definitions =====

# Request counter (grouped by method, path, status code)
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code']
)

# Request latency histogram (observing P50/P95/P99)
REQUEST_LATENCY = Histogram(
    'request_duration_seconds',
    'HTTP request latency in seconds',
    ['method', 'endpoint'],
    buckets=[0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0],
)

# LLM Token usage
LLM_TOKENS_USED = Counter(
    'llm_tokens_used_total',
    'Total LLM tokens consumed',
    ['model', 'operation']  # operation: chat/completion/embedding
)

# SQL query latency
SQL_QUERY_LATENCY = Histogram(
    'sql_query_duration_seconds',
    'SQL query execution time',
    ['operation'],  # select/insert/update
    buckets=[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 1.0],
)

# Cache hit rate
CACHE_HITS = Counter('cache_hits_total', 'Cache hits', ['level'])  # level: l1/l2
CACHE_MISSES = Counter('cache_misses_total', 'Cache misses', ['level'])

# Current active connections
ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Currently active database connections'
)


# ===== Middleware: automatic metric collection =====

async def metrics_middleware(request, call_next):
    """FastAPI middleware: record metrics for each request"""
    start_time = time.time()

    response = await call_next(request)

    duration = time.time() - start_time

    # Record request count and latency
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status_code=response.status_code
    ).inc()

    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(duration)

    return response


# ===== Decorator for specific functions =====

def track_llm_usage(operation: str):
    """Decorator to track LLM token usage"""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            result = await func(*args, **kwargs)

            # Assume return value contains token information
            if hasattr(result, 'usage'):
                LLM_TOKENS_USED.labels(
                    model=result.model or "unknown",
                    operation=operation
                ).inc(result.usage.total_tokens)

            return result
        return wrapper
    return decorator


def track_sql_query(operation: str):
    """Decorator to track SQL execution time"""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            start = time.time()
            try:
                result = await func(*args, **kwargs)
                SQL_QUERY_LATENCY.labels(operation=operation).observe(time.time() - start)
                return result
            except Exception as e:
                SQL_QUERY_LATENCY.labels(operation=operation).observe(time.time() - start)
                raise
        return wrapper
    return decorator

3.2 Integration into FastAPI

app/main.py (continued)

from fastapi.middleware import Middleware
from starlette.middleware.base import BaseHTTPMiddleware
from prometheus_client import make_asgi_app

app = FastAPI(middleware=[
    Middleware(BaseHTTPMiddleware, dispatch=metrics_middleware),
])

# Add /metrics endpoint (Prometheus scrapes metrics)
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

3.3 Alert Rules Configuration (alertmanager.yml)

groups:
  - name: api_alerts
    rules:
      # P1: High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[2m]))
          /
          sum(rate(http_requests_total[2m])) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "API error rate exceeds 5%"
          description: "Error rate in the last 2 minutes: {{ $value | humanizePercentage }}"

      # P1: P99 latency too high
      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "P99 latency exceeds 5 seconds"
          description: "Current P99={{ $value }}s"

      # P2: Average latency elevated
      - alert: HighAvgLatency
        expr: |
          histogram_quantile(0.50, rate(request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P50 latency exceeds 1 second"

      # P2: Low cache hit rate
      - alert: LowCacheHitRate
        expr: |
          sum(rate(cache_hits_total[5m])) /
          (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) < 0.7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate below 70%"
          description: "Current hit rate={{ $value | humanizePercentage }}"

      # P3: Resource usage alerts
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage exceeds 85%"

🐳 4. Docker Containerization Deployment

4.1 Dockerfile

Dockerfile

# Multi-stage build: reduce image size
FROM python:3.11-slim AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Production image
FROM python:3.11-slim

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1

WORKDIR /app

# Copy dependencies from builder
COPY --from=builder /install /usr/local

# Copy application code
COPY . .

# Create non-root user for running
RUN adduser --disabled-password --gecos '' appuser && chown -R appuser:appuser /app
USER appuser

EXPOSE 8000

# Run with Uvicorn (4 Worker processes)
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

4.2 docker-compose.yml

docker-compose.yml

version: '3.8'

services:
  # Backend API service
  backend:
    build: ./backend
    container_name: nlp-api
    restart: unless-stopped
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=mysql+aiomysql://root:${MYSQL_ROOT_PASSWORD}@mysql:3306/nlp_db
      - REDIS_URL=redis://redis:6379/0
      - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY}
      - LOG_LEVEL=info
    depends_on:
      mysql:
        condition: service_healthy
      redis:
        condition: service_healthy
    networks:
      - app-network
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  # MySQL Database
  mysql:
    image: mysql:8.0
    container_name: nlp-mysql
    restart: unless-stopped
    environment:
      MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD:-rootpassword123}
      MYSQL_DATABASE: nlp_db
      MYSQL_CHARACTER_SET_SERVER: utf8mb4
      MYSQL_COLLATION_SERVER: utf8mb4_unicode_ci
    volumes:
      - mysql_data:/var/lib/mysql
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    ports:
      - "3306:3306"
    command:
      - --character-set-server=utf8mb4
      - --collation-server=utf8mb4_unicode_ci
      - --max-connections=200
      - --innodb-buffer-pool-size=256M
    networks:
      - app-network
    healthcheck:
      test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-u", "root", "-p${MYSQL_ROOT_PASSWORD}"]
      interval: 10s
      timeout: 5s
      retries: 5

  # Redis Cache
  redis:
    image: redis:7-alpine
    container_name: nlp-redis
    restart: unless-stopped
    command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    ports:
      - "6379:6379"
    networks:
      - app-network
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

  # Nginx Reverse Proxy
  nginx:
    image: nginx:alpine
    container_name: nlp-nginx
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./ssl:/etc/nginx/ssl:ro
      - ./frontend/dist:/usr/share/nginx/html:ro  # Vue build output
    depends_on:
      - backend
    networks:
      - app-network

  # Prometheus Metrics Collection
  prometheus:
    image: prom/prometheus:latest
    container_name: nlp-prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    networks:
      - app-network

  # Grafana Visualization
  grafana:
    image: grafana/grafana:latest
    container_name: nlp-grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    depends_on:
      - prometheus
    networks:
      - app-network

volumes:
  mysql_data:
  redis_data:
  prometheus_data:
  grafana_data:

networks:
  app-network:
    driver: bridge

4.3 Nginx Configuration

nginx.conf

events {
    worker_connections 1024;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    # Log format (includes request ID for traceability)
    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$request_id" '
                    'rt=$request_time';

    access_log /var/log/nginx/access.log main;
    error_log  /var/log/nginx/error.log warn;

    # Gzip compression
    gzip on;
    gzip_vary on;
    gzip_min_length 1024;
    gzip_types text/plain text/css application/json application/javascript text/xml;

    # Rate Limiting
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/m;
    limit_req_zone $binary_remote_addr zone=llm_limit:10m rate=20r/m;

    upstream backend {
        server backend:8000;
    }

    server {
        listen 80;
        server_name your-domain.com;
        return 301 https://$server_name$request_uri;
    }

    server {
        listen 443 ssl http2;
        server_name your-domain.com;

        ssl_certificate     /etc/nginx/ssl/fullchain.pem;
        ssl_certificate_key /etc/nginx/ssl/privkey.pem;
        ssl_protocols       TLSv1.2 TLSv1.3;
        ssl_ciphers         HIGH:!aNULL:!MD5;

        # Frontend static files
        location / {
            root /usr/share/nginx/html;
            try_files $uri $uri/ /index.html;  # Vue Router history mode
            expires 1d;
            add_header Cache-Control "public, immutable";
        }

        # API proxy
        location /api/ {
            proxy_pass http://backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Request-ID $request_id;

            # General rate limiting
            limit_req zone=api_limit burst=20 nodelay;

            # WebSocket support (if needed)
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";

            # Timeout settings
            proxy_connect_timeout 10s;
            proxy_read_timeout 60s;  # LLM endpoints may be slower
            proxy_send_timeout 60s;
        }

        # LLM-related endpoints with stricter rate limiting
        location /api/chat/ {
            proxy_pass http://backend;
            limit_req zone=llm_limit burst=5 nodelay;
            proxy_read_timeout 120s;  # LLM may require more time
        }

        # Prometheus scrape endpoint (internal network only)
        location /metrics {
            allow 172.16.0.0/12;  # Only allow internal network
            deny all;
            proxy_pass http://backend;
        }
    }
}

📝 5. Logging System (ELK Stack)

5.1 Structured Logging Configuration

logging.conf

import logging
import json
import sys
from pythonjsonlogger import jsonlogger

class CustomJsonFormatter(jsonlogger.JsonFormatter):
    """Custom JSON log formatter"""

    def add_fields(self, log_record, record, message_dict):
        super().add_fields(log_record, record, message_dict)

        # Add extra fields
        log_record['level'] = record.levelname
        log_record['module'] = record.module
        log_record['function'] = record.funcName
        log_record['line'] = record.lineno


def setup_logging(level: str = "INFO"):
    """Initialize logging system"""
    formatter = CustomJsonFormatter(
        '%(timestamp)s %(level)s %(module)s %(message)s %(extra)s'
    )

    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(formatter)

    root_logger = logging.getLogger()
    root_logger.setLevel(getattr(logging, level.upper()))
    root_logger.addHandler(handler)

    # Reduce log level for third-party libraries
    logging.getLogger("uvicorn").setLevel(logging.WARNING)
    logging.getLogger("sqlalchemy").setLevel(logging.WARNING)

5.2 Log Output Example

{
  "timestamp": "2026-05-22T14:30:22.123Z",
  "level": "INFO",
  "module": "chat_engine",
  "message": "Chat request completed successfully",
  "extra": {
    "session_id": "sess_abc123",
    "user_id": 42,
    "duration_ms": 3250,
    "tokens_used": 156,
    "sql_executed": true,
    "request_id": "req_xyz789"
  }
}

🎯 6. Best Practices Summary and Performance Benchmarks

6.1 Production Environment Checklist

✅ Performance Optimization

Full-stack async/await (FastAPI + SQLAlchemy 2.0 + httpx + aioredis)
Three-level cache architecture (L1 memory → L2 Redis → L3 MySQL)
asyncio.gather() for concurrent queries (Dashboard scenarios)
Nginx Gzip compression + static resource caching
Database connection pool optimization (pool_size=20, max_overflow=10)

✅ Reliability Assurance

Docker Compose one-click deployment + restart: unless-stopped
Health Check endpoint (/health)
Graceful Shutdown (SIGTERM signal handling)
Database connection auto-recycle (pool_recycle=3600)
Redis max memory limit + LRU eviction policy

✅ Security Hardening

SQL injection prevention (validate_sql regex check)
Rate Limiting (Nginx layer: 100 req/min)
JWT authentication + CORS whitelist
HTTPS forced redirect (SSL/TLS 1.2+)
Sensitive information management via environment variables (.env file)

✅ Observability

Prometheus metric collection (request count/latency/Token/cache hit rate)
Grafana monitoring dashboards (3 dashboards)
AlertManager tiered alerts (P1/P2/P3)
ELK structured logging (JSON format + associated Request ID)
Distributed tracing (unique ID per request)

6.2 Performance Benchmark Results

Scenario	P50	P95	P99	QPS (single instance)	Notes
Health Check (`/health`)	2ms	5ms	12ms	5000+	Pure in-memory operation
Dashboard Load (with cache)	45ms	120ms	250ms	2000	L2 hit rate ~85%
Chat Dialog (calling DeepSeek)	2100ms	3800ms	5200	50	Bottleneck is LLM API
NL→SQL Conversion	2500ms	4200ms	6000	45	Includes semantic model loading
Metadata Analysis	3000ms	5500ms	8000	20	Large table scan scenario

Test Environment: 4-core 8GB Docker containers × 2 (total 8 cores, 16GB)

6.3 Monthly Cost Estimate (100 DAU Scenario)

Item	Monthly Cost	Description
Cloud Servers (2C4G × 2)	¥400	Alibaba Cloud/Tencent Cloud lightweight servers
LLM API Tokens	¥200-500	Depends on dialog frequency (DeepSeek pricing)
CDN + Bandwidth	¥100	Static resource distribution + API traffic
Domain + SSL Certificate	¥50	.com domain + Let’s Encrypt free SSL
Total	¥750 - ¥1,050	Affordable for small teams

🏆 7. Series Summary

Congratulations on finishing all 8 technical blog posts! Let’s review the core highlights of the entire project:

📚 Series Recap

Part	Core Topic	Key Technologies
Part 1	Project Overview & Technology Selection	Vue3 + FastAPI + MySQL + Redis + ECharts
Part 2	LLM Unified Gateway Design	LiteLLM abstraction layer, Prompt engineering, response cleaning
Part 3	Intelligent Metadata Management System	LLM automatic table structure analysis, relationship inference algorithm
Part 4	NL→SQL Conversion Engine	Semantic model injection, SQL generation, security validation
Part 5	Intelligent Dialogue Engine in Practice	Multi-turn context management, chart recommendation, report generation
Part 6	Two-Stage Separation for Data Dashboards	Design-time vs runtime decoupling, configuration-driven philosophy
Part 7	Frontend Drag-and-Drop Interaction System	mousedown events, Grid layout, AABB collision detection
Part 8	Production Deployment & Performance Optimization	Asynchronous architecture, three-level cache, Prometheus monitoring

🎯 Core Innovations of the Project

Two-Stage Separation Architecture (Part 6): LLM used only at design time, zero AI cost at runtime
Intelligent Chart Recommendation (Part 5): Automatically selects visualization based on SQL features
Multi-turn Context Management (Part 5): Maintains 10-turn conversation history, understands references
Full-stack Asynchronous Architecture (Part 8): async/await from frontend to backend to database
Three-level Cache System (Part 8): L1→L2→L3, hit rate 90%+
Declarative UI System (Parts 6-7): JSON configuration-driven, supports free drag-and-drop layout

💡 Applicable Scenarios

✅ Suitable for:

Internal enterprise data analysis platforms
Self-service BI tools for non-technical users
SaaS product prototypes requiring NL→SQL capability
Rapid data visualization MVP building

⚠️ Caveats:

LLM-generated SQL may not be perfectly accurate (require manual review for critical queries)
Complex analysis scenarios still require professional analysts
Data-sensitive scenarios need private deployment of LLM

🚀 Next Steps Recommendations

If you’re interested in this project, we suggest following this order:

Local Startup: Backend uvicorn + frontend npm run dev, first get the main flow working
Read Part 4 to understand the core NL→SQL conversion process
Try Custom Prompts (Part 2), adapting to your business scenarios
Extend Widget Types (Parts 6-7), add more chart components
Integrate Other LLMs (Part 2), such as GPT-4, Claude, etc.
Deploy to Production (Part 8), experience the full DevOps workflow

Author: DevCfg.cn

Date: May 2026

License: MIT License

Thanks for reading! Feel free to discuss in the comments if you have any questions 🎉