0. Series Closure
| This Article Position | Upstream | This Article Output | Downstream |
|---|---|---|---|
| Article 9/10 | Article 08 Inference Verification | Qwen3.5 Specific Issue Checklist | Article 10 vLLM Request Parameters |
The issues in this article are not unique to LoRA, but they can easily lead to false positives like “verification fails like a failure” or “API outputs English”, mistakenly attributed to “fine-tuning being ineffective.”
1. Actual Problems to Solve
Qwen3.5 introduces Thinking Mode compared to Qwen2: the model can first output its reasoning process through an internal channel, then give the final reply.
In elderly companionship scenarios:
- The user expects short, empathetic Chinese sentences
- The thinking chain is often in English, verbose, and looks like debug output
- If not filtered, the product becomes completely unusable
Additionally: both all_logs.log and Mac verification show a flash-linear-attention not installed warning; we need to know whether it blocks training.
2. Implementation Locations
| Issue | Code / Log Location |
|---|---|
| Disable thinking (local) | verify_lora.py lines 159–164 enable_thinking=False |
| Strip thinking residues | verify_lora.py lines 137–149 extract_final_reply |
| flash-attn warning | all_logs.log line 27; also appears during Mac verification |
| Disable thinking (vLLM) | README.md lines 213–233 chat_template_kwargs |
| trust_remote_code | train_lora_single.py lines 161, 180 |
3. Pitfall 1: Thinking Mode Not Disabled
Symptom
generate or vLLM returns large sections like:
1 | |
Or only English output, with no Chinese companionship sentences.
Root Cause
The Qwen3.5 chat_template defaults to enable_thinking=True (reasoning path).
Fix (Local)
1 | |
Fix (vLLM)
1 | |
Python SDK:
1 | |
Fallback
Even if disabled, occasional residues remain. extract_final_reply truncates based on the redacted_thinking tag. During training, apply_chat_template uses the full messages and does not involve thinking generation.
4. Pitfall 2: flash-linear-attention / causal-conv1d Not Installed
Log Excerpt
1 | |
Impact
| Correctness | ✅ Falls back to PyTorch; training and inference both complete |
| Speed | Slightly slower; this project completed 750 steps in 41 minutes on V100 |
| Mac MPS | Same warning; user verification still succeeds |
Is Installation Mandatory?
Not mandatory. If compilation fails in cloud environments, you can skip it. Install only if you need maximum throughput:
1 | |
5. Pitfall 3: Tokenizer and Model Config Token ID Mismatch
Log
1 | |
Handling
In train_lora_single.py line 164:
1 | |
The Trainer automatically aligns them at startup. No need to manually modify config.json.
6. Pitfall 4: Mac MPS + PEFT device_map
Symptom
Using device_map="auto" to load the base model, then PeftModel.from_pretrained on MPS causes device mismatch errors.
Fix
In verify_lora.py lines 110–134: Do not use device_map; explicitly use .to(device).
The training script still uses device_map={"": gpu_id} on CUDA — the device strategy for verification and training are deliberately different, do not copy blindly.
7. Pitfall 5: trust_remote_code=False
Qwen3.5 model classes are in Python files within the repository. Omitting trust_remote_code=True will cause:
1 | |
Must be set to True consistently in three places:
train_lora_single.pyloading tokenizer/modelverify_lora.pyloading- vLLM serving from a local path when the model directory is complete
8. Pitfall 6: pad_token_id Warning for Open-End Generation
During verification you may see:
1 | |
This is a transformers hint during generate, consistent with the training-side pad setting. Can be ignored.
9. Three-End Checklist (Training / Verification / Deployment)
1 | |
10. Summary
- Thinking is the number one pitfall for Qwen3.5 verification/deployment; must be explicitly disabled + optional post-processing.
- flash-attn warnings only affect speed; training and verification work in practice.
- Mac MPS verification path uses a different device strategy than CUDA training path.
- When seeing “fine-tuning ineffective”, first check thinking, then system prompt.
- The next article implements deployment-side parameters in vLLM.
Appendix: Training vs Inference Template Comparison
| Parameter | train_lora_single.py | verify_lora.py | vLLM |
|---|---|---|---|
| add_generation_prompt | False | True | Handled by API automatically |
| enable_thinking | — | False | chat_template_kwargs |
| messages include system | Yes (JSONL) | Yes | Yes |
Series Navigation
| Article | Link |
|---|---|
| Previous | 08 · Effect Verification |
| Next | 10 · vLLM Deployment |
| Index | README |