vLLM Deployment: Dynamic Mounting of LoRA and OpenAI-Compatible API

0. Series Closure

This Article	Upstream	Output	Downstream
Part 10/10	Training + Validation	Production-ready HTTP API	Business App / Mini Program / Voice Gateway

This series adopts Approach B: Dynamic LoRA Attachment, without writing a merge_lora.py merging routine. The base model (~~8.7 GB) and adapter (~~41 MB) are separated; swapping LoRA only requires replacing a small directory.

1. The Real Problem to Solve

verify_lora.py is a single-process, single-user, serial generate. Going online requires:

Concurrent requests
Stable latency
OpenAI-compatible interface (existing clients/SDKs can be reused)
Efficient GPU memory reuse (PagedAttention)

vLLM’s role in this project: Serve Qwen3.5-4B + elderly LoRA on Linux + NVIDIA.

2. Implementation Location & Architecture

flowchart LR
    BM[models/Qwen3.5-4B]
    AD[output/.../final_lora]
    VV[vllm serve]
    API["POST /v1/chat/completions"]
    BM --> VV
    AD --> VV
    VV --> API

Path	Description
`LoRA_Demo/models/Qwen3.5-4B/`	Base model
`LoRA_Demo/output/lora_elderly_single/final_lora/`	Adapter
`LoRA_Demo/README.md`	curl / Python examples
`adapter_config.json`	r=8, matching `--max-lora-rank`

3. Environment Requirements

1	`pip install "vllm>=0.19.0"`

Item	Requirement
OS	Linux (macOS not supported)
GPU	V100-32GB / 4090 / A10-24GB etc.
Disk	Base model + vLLM cache, ≥15 GB free space recommended
Prerequisite	Validation from Part 08 passed

4. Startup Command (Dynamic LoRA)

cd LoRA_Demo

vllm serve ./models/Qwen3.5-4B \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-lora \
  --max-lora-rank 8 \
  --max-loras 1 \
  --lora-modules elderly=./output/lora_elderly_single/final_lora \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.90

Parameters Aligned with Training Configuration

vLLM Parameter	Training Counterpart	Consequence of Mismatch
`--max-lora-rank 8`	`LORA_R=8`	Startup failure or LoRA not loaded
`--lora-modules elderly=...`	`final_lora/` path	404 / default base model replies
`--max-model-len 2048`	Training `MAX_SEQ_LEN=512`	Inference can be longer; consumes more memory

elderly is the model name used in API requests, not a disk directory name.

Why Not Merge

merge_lora.py bakes the LoRA into the base model (new ~8GB directory). Pros: better compatibility. Cons: occupies disk space, requires re-merging to switch adapters.
This tutorial series prioritizes dynamic LoRA; if the vLLM version is incompatible with Qwen3.5, consider merging (covered in a separate article).

5. HTTP Call

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "elderly",
    "messages": [
      {
        "role": "system",
        "content": "你是一位温柔、耐心、善解人意的老年情感陪伴助手，说话慢一点、软一点，多共情、多倾听、多肯定，不说教、不反驳、不催促。"
      },
      {
        "role": "user",
        "content": "晚上睡不着，心里乱糟糟的。"
      }
    ],
    "max_tokens": 200,
    "temperature": 0.7,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Required Items

model: "elderly" — must match --lora-modules elderly=...
system must be verbatim from JSONL — see verify_lora.py lines 33–36
enable_thinking: false — from Part 09

Expected Response Direction

Chinese, empathic, short sentences, with companionship tone—similar to the Mac-validated “can’t sleep at night” example, not a list of advice.

6. Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="elderly",
    messages=[
        {
            "role": "system",
            "content": (
                "你是一位温柔、耐心、善解人意的老年情感陪伴助手，"
                "说话慢一点、软一点，多共情、多倾听、多肯定，"
                "不说教、不反驳、不催促。"
            ),
        },
        {"role": "user", "content": "孩子们都不来看我"},
    ],
    max_tokens=200,
    temperature=0.7,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)

api_key="EMPTY" — vLLM locally does not validate keys by default. For public exposure, add a reverse proxy with authentication (see security below).

7. Parameter Alignment with verify_lora.py

Item	verify_lora.py	vLLM
temperature	0.7	0.7 (adjustable)
top_p	0.9	Can be passed in `extra`
max_new_tokens	200	max_tokens: 200
system	SYSTEM_PROMPT	messages[0]
thinking	enable_thinking=False	chat_template_kwargs

On production, slightly lowering temperature (0.6) can reduce random gibberish; requires A/B testing.

8. Production Checklist

Functionality
  □ curl returns 200, content is Chinese companionship style
  □ model=elderly shows clear difference from no LoRA (temporarily remove --enable-lora to compare)
  □ thinking disabled, no English chains

Resources
  □ nvidia-smi shows stable GPU memory usage
  □ --gpu-memory-utilization 0.85~0.92; do not set too high to prevent OOM

Security
  □ Public network: firewall only open port 8000 or use Nginx
  □ Add API Key / mTLS; do not expose 0.0.0.0 naked
  □ Logs do not store raw user text (compliance)

Operations
  □ Updating LoRA: replace final_lora → restart vLLM
  □ When base model is upgraded but adapter r/target remains unchanged, adapter can be reused

9. Pitfalls

Pitfall 1: Wrong model name
If you fill in Qwen3.5-4B but --lora-modules registers elderly, it may fall back to the base model without LoRA.

Pitfall 2: Omitting or modifying system prompt
If the system prompt is shortened in production, the style reverts to a general assistant—same illusion as “fine-tuning didn’t work.”

Pitfall 3: vLLM version too old
Qwen3.5 requires a relatively new vLLM (README recommends ≥0.19). If startup reports architecture errors, upgrade first.

Pitfall 4: max-model-len too large + high concurrency leads to OOM
2048 is sufficient for companionship; if chat history is long, the client must truncate messages.

Pitfall 5: Deploying vLLM on Mac
Not possible. Training/validation can be done on Mac, but serving requires Linux + GPU.

10. Full Series Pipeline Review

Phase	Script/Command	Actual Reference
Data	`elderly_chat.jsonl`	1000 entries
Training	`python train_lora_single.py`	V100 41 min
Metrics	`all_logs.log`	loss 2.81→0.13
Validation	`python verify_lora.py`	Mac MPS passed
Deployment	`vllm serve ... --enable-lora`	Linux GPU

Not covered in this series: train_lora_multi.py (multi-GPU), merge_lora.py (merging).

11. Summary

Use --lora-modules elderly=final_lora for dynamic attachment—no merging needed.
Three items must be consistent with training/validation: model name, system prompt, enable_thinking.
--max-lora-rank 8 must match adapter_config.json.
Production can be integrated using the OpenAI SDK.
For public deployment, add authentication and privacy policies.

Appendix: README Deployment Architecture Text

1	`Base Model Qwen3.5-4B + LoRA Adapter → vLLM Service → OpenAI-Compatible API`

Path: LoRA_Demo/README.md lines 146–148.

Article	Link
Previous	09 · Qwen3.5 Pitfalls
Index	README

End of series. Extended reading: Multi-GPU DDP, merged deployment, validation set quantitative evaluation—to be covered in a separate topic after experiments are completed.

← Back to LoRA Elderly Companionship Topic