0. Series Closure

This Article Upstream Output Downstream
Part 10/10 Training + Validation Production-ready HTTP API Business App / Mini Program / Voice Gateway

This series adopts Approach B: Dynamic LoRA Attachment, without writing a merge_lora.py merging routine. The base model (8.7 GB) and adapter (41 MB) are separated; swapping LoRA only requires replacing a small directory.


1. The Real Problem to Solve

verify_lora.py is a single-process, single-user, serial generate. Going online requires:

  • Concurrent requests
  • Stable latency
  • OpenAI-compatible interface (existing clients/SDKs can be reused)
  • Efficient GPU memory reuse (PagedAttention)

vLLM’s role in this project: Serve Qwen3.5-4B + elderly LoRA on Linux + NVIDIA.


2. Implementation Location & Architecture

1
2
3
4
5
6
7
8
flowchart LR
BM[models/Qwen3.5-4B]
AD[output/.../final_lora]
VV[vllm serve]
API["POST /v1/chat/completions"]
BM --> VV
AD --> VV
VV --> API
Path Description
LoRA_Demo/models/Qwen3.5-4B/ Base model
LoRA_Demo/output/lora_elderly_single/final_lora/ Adapter
LoRA_Demo/README.md curl / Python examples
adapter_config.json r=8, matching --max-lora-rank

3. Environment Requirements

1
pip install "vllm>=0.19.0"
Item Requirement
OS Linux (macOS not supported)
GPU V100-32GB / 4090 / A10-24GB etc.
Disk Base model + vLLM cache, ≥15 GB free space recommended
Prerequisite Validation from Part 08 passed

4. Startup Command (Dynamic LoRA)

1
2
3
4
5
6
7
8
9
10
11
cd LoRA_Demo

vllm serve ./models/Qwen3.5-4B \
--host 0.0.0.0 \
--port 8000 \
--enable-lora \
--max-lora-rank 8 \
--max-loras 1 \
--lora-modules elderly=./output/lora_elderly_single/final_lora \
--max-model-len 2048 \
--gpu-memory-utilization 0.90

Parameters Aligned with Training Configuration

vLLM Parameter Training Counterpart Consequence of Mismatch
--max-lora-rank 8 LORA_R=8 Startup failure or LoRA not loaded
--lora-modules elderly=... final_lora/ path 404 / default base model replies
--max-model-len 2048 Training MAX_SEQ_LEN=512 Inference can be longer; consumes more memory

elderly is the model name used in API requests, not a disk directory name.

Why Not Merge

merge_lora.py bakes the LoRA into the base model (new ~8GB directory). Pros: better compatibility. Cons: occupies disk space, requires re-merging to switch adapters.
This tutorial series prioritizes dynamic LoRA; if the vLLM version is incompatible with Qwen3.5, consider merging (covered in a separate article).


5. HTTP Call

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "elderly",
"messages": [
{
"role": "system",
"content": "你是一位温柔、耐心、善解人意的老年情感陪伴助手,说话慢一点、软一点,多共情、多倾听、多肯定,不说教、不反驳、不催促。"
},
{
"role": "user",
"content": "晚上睡不着,心里乱糟糟的。"
}
],
"max_tokens": 200,
"temperature": 0.7,
"chat_template_kwargs": {"enable_thinking": false}
}'

Required Items

  1. model: "elderly" — must match --lora-modules elderly=...
  2. system must be verbatim from JSONL — see verify_lora.py lines 33–36
  3. enable_thinking: false — from Part 09

Expected Response Direction

Chinese, empathic, short sentences, with companionship tone—similar to the Mac-validated “can’t sleep at night” example, not a list of advice.


6. Python Client

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
model="elderly",
messages=[
{
"role": "system",
"content": (
"你是一位温柔、耐心、善解人意的老年情感陪伴助手,"
"说话慢一点、软一点,多共情、多倾听、多肯定,"
"不说教、不反驳、不催促。"
),
},
{"role": "user", "content": "孩子们都不来看我"},
],
max_tokens=200,
temperature=0.7,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)

api_key="EMPTY" — vLLM locally does not validate keys by default. For public exposure, add a reverse proxy with authentication (see security below).


7. Parameter Alignment with verify_lora.py

Item verify_lora.py vLLM
temperature 0.7 0.7 (adjustable)
top_p 0.9 Can be passed in extra
max_new_tokens 200 max_tokens: 200
system SYSTEM_PROMPT messages[0]
thinking enable_thinking=False chat_template_kwargs

On production, slightly lowering temperature (0.6) can reduce random gibberish; requires A/B testing.


8. Production Checklist

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Functionality
□ curl returns 200, content is Chinese companionship style
□ model=elderly shows clear difference from no LoRA (temporarily remove --enable-lora to compare)
□ thinking disabled, no English chains

Resources
□ nvidia-smi shows stable GPU memory usage
□ --gpu-memory-utilization 0.85~0.92; do not set too high to prevent OOM

Security
□ Public network: firewall only open port 8000 or use Nginx
□ Add API Key / mTLS; do not expose 0.0.0.0 naked
□ Logs do not store raw user text (compliance)

Operations
□ Updating LoRA: replace final_lora → restart vLLM
□ When base model is upgraded but adapter r/target remains unchanged, adapter can be reused

9. Pitfalls

Pitfall 1: Wrong model name
If you fill in Qwen3.5-4B but --lora-modules registers elderly, it may fall back to the base model without LoRA.

Pitfall 2: Omitting or modifying system prompt
If the system prompt is shortened in production, the style reverts to a general assistant—same illusion as “fine-tuning didn’t work.”

Pitfall 3: vLLM version too old
Qwen3.5 requires a relatively new vLLM (README recommends ≥0.19). If startup reports architecture errors, upgrade first.

Pitfall 4: max-model-len too large + high concurrency leads to OOM
2048 is sufficient for companionship; if chat history is long, the client must truncate messages.

Pitfall 5: Deploying vLLM on Mac
Not possible. Training/validation can be done on Mac, but serving requires Linux + GPU.


10. Full Series Pipeline Review

Phase Script/Command Actual Reference
Data elderly_chat.jsonl 1000 entries
Training python train_lora_single.py V100 41 min
Metrics all_logs.log loss 2.81→0.13
Validation python verify_lora.py Mac MPS passed
Deployment vllm serve ... --enable-lora Linux GPU

Not covered in this series: train_lora_multi.py (multi-GPU), merge_lora.py (merging).


11. Summary

  1. Use --lora-modules elderly=final_lora for dynamic attachment—no merging needed.
  2. Three items must be consistent with training/validation: model name, system prompt, enable_thinking.
  3. --max-lora-rank 8 must match adapter_config.json.
  4. Production can be integrated using the OpenAI SDK.
  5. For public deployment, add authentication and privacy policies.

Appendix: README Deployment Architecture Text

1
Base Model Qwen3.5-4B  +  LoRA Adapter  →  vLLM Service  →  OpenAI-Compatible API

Path: LoRA_Demo/README.md lines 146–148.


Series Navigation

Article Link
Previous 09 · Qwen3.5 Pitfalls
Index README

End of series. Extended reading: Multi-GPU DDP, merged deployment, validation set quantitative evaluation—to be covered in a separate topic after experiments are completed.


← Back to LoRA Elderly Companionship Topic