0. Series Closure
| This Article | Upstream | Output | Downstream |
|---|---|---|---|
| Part 10/10 | Training + Validation | Production-ready HTTP API | Business App / Mini Program / Voice Gateway |
This series adopts Approach B: Dynamic LoRA Attachment, without writing a merge_lora.py merging routine. The base model (8.7 GB) and adapter (41 MB) are separated; swapping LoRA only requires replacing a small directory.
1. The Real Problem to Solve
verify_lora.py is a single-process, single-user, serial generate. Going online requires:
- Concurrent requests
- Stable latency
- OpenAI-compatible interface (existing clients/SDKs can be reused)
- Efficient GPU memory reuse (PagedAttention)
vLLM’s role in this project: Serve Qwen3.5-4B + elderly LoRA on Linux + NVIDIA.
2. Implementation Location & Architecture
1 | |
| Path | Description |
|---|---|
LoRA_Demo/models/Qwen3.5-4B/ |
Base model |
LoRA_Demo/output/lora_elderly_single/final_lora/ |
Adapter |
LoRA_Demo/README.md |
curl / Python examples |
adapter_config.json |
r=8, matching --max-lora-rank |
3. Environment Requirements
1 | |
| Item | Requirement |
|---|---|
| OS | Linux (macOS not supported) |
| GPU | V100-32GB / 4090 / A10-24GB etc. |
| Disk | Base model + vLLM cache, ≥15 GB free space recommended |
| Prerequisite | Validation from Part 08 passed |
4. Startup Command (Dynamic LoRA)
1 | |
Parameters Aligned with Training Configuration
| vLLM Parameter | Training Counterpart | Consequence of Mismatch |
|---|---|---|
--max-lora-rank 8 |
LORA_R=8 |
Startup failure or LoRA not loaded |
--lora-modules elderly=... |
final_lora/ path |
404 / default base model replies |
--max-model-len 2048 |
Training MAX_SEQ_LEN=512 |
Inference can be longer; consumes more memory |
elderly is the model name used in API requests, not a disk directory name.
Why Not Merge
merge_lora.py bakes the LoRA into the base model (new ~8GB directory). Pros: better compatibility. Cons: occupies disk space, requires re-merging to switch adapters.
This tutorial series prioritizes dynamic LoRA; if the vLLM version is incompatible with Qwen3.5, consider merging (covered in a separate article).
5. HTTP Call
1 | |
Required Items
model:"elderly"— must match--lora-modules elderly=...systemmust be verbatim from JSONL — seeverify_lora.pylines 33–36enable_thinking: false— from Part 09
Expected Response Direction
Chinese, empathic, short sentences, with companionship tone—similar to the Mac-validated “can’t sleep at night” example, not a list of advice.
6. Python Client
1 | |
api_key="EMPTY" — vLLM locally does not validate keys by default. For public exposure, add a reverse proxy with authentication (see security below).
7. Parameter Alignment with verify_lora.py
| Item | verify_lora.py | vLLM |
|---|---|---|
| temperature | 0.7 | 0.7 (adjustable) |
| top_p | 0.9 | Can be passed in extra |
| max_new_tokens | 200 | max_tokens: 200 |
| system | SYSTEM_PROMPT | messages[0] |
| thinking | enable_thinking=False | chat_template_kwargs |
On production, slightly lowering temperature (0.6) can reduce random gibberish; requires A/B testing.
8. Production Checklist
1 | |
9. Pitfalls
Pitfall 1: Wrong model name
If you fill in Qwen3.5-4B but --lora-modules registers elderly, it may fall back to the base model without LoRA.
Pitfall 2: Omitting or modifying system prompt
If the system prompt is shortened in production, the style reverts to a general assistant—same illusion as “fine-tuning didn’t work.”
Pitfall 3: vLLM version too old
Qwen3.5 requires a relatively new vLLM (README recommends ≥0.19). If startup reports architecture errors, upgrade first.
Pitfall 4: max-model-len too large + high concurrency leads to OOM
2048 is sufficient for companionship; if chat history is long, the client must truncate messages.
Pitfall 5: Deploying vLLM on Mac
Not possible. Training/validation can be done on Mac, but serving requires Linux + GPU.
10. Full Series Pipeline Review
| Phase | Script/Command | Actual Reference |
|---|---|---|
| Data | elderly_chat.jsonl |
1000 entries |
| Training | python train_lora_single.py |
V100 41 min |
| Metrics | all_logs.log |
loss 2.81→0.13 |
| Validation | python verify_lora.py |
Mac MPS passed |
| Deployment | vllm serve ... --enable-lora |
Linux GPU |
Not covered in this series: train_lora_multi.py (multi-GPU), merge_lora.py (merging).
11. Summary
- Use
--lora-modules elderly=final_lorafor dynamic attachment—no merging needed. - Three items must be consistent with training/validation:
modelname,systemprompt,enable_thinking. --max-lora-rank 8must matchadapter_config.json.- Production can be integrated using the OpenAI SDK.
- For public deployment, add authentication and privacy policies.
Appendix: README Deployment Architecture Text
1 | |
Path: LoRA_Demo/README.md lines 146–148.
Series Navigation
| Article | Link |
|---|---|
| Previous | 09 · Qwen3.5 Pitfalls |
| Index | README |
End of series. Extended reading: Multi-GPU DDP, merged deployment, validation set quantitative evaluation—to be covered in a separate topic after experiments are completed.