0. Series Closure

This Article Position Upstream This Article Output Downstream
Article 9/10 Article 08 Inference Verification Qwen3.5 Specific Issue Checklist Article 10 vLLM Request Parameters

The issues in this article are not unique to LoRA, but they can easily lead to false positives like “verification fails like a failure” or “API outputs English”, mistakenly attributed to “fine-tuning being ineffective.”


1. Actual Problems to Solve

Qwen3.5 introduces Thinking Mode compared to Qwen2: the model can first output its reasoning process through an internal channel, then give the final reply.

In elderly companionship scenarios:

  • The user expects short, empathetic Chinese sentences
  • The thinking chain is often in English, verbose, and looks like debug output
  • If not filtered, the product becomes completely unusable

Additionally: both all_logs.log and Mac verification show a flash-linear-attention not installed warning; we need to know whether it blocks training.


2. Implementation Locations

Issue Code / Log Location
Disable thinking (local) verify_lora.py lines 159–164 enable_thinking=False
Strip thinking residues verify_lora.py lines 137–149 extract_final_reply
flash-attn warning all_logs.log line 27; also appears during Mac verification
Disable thinking (vLLM) README.md lines 213–233 chat_template_kwargs
trust_remote_code train_lora_single.py lines 161, 180

3. Pitfall 1: Thinking Mode Not Disabled

Symptom

generate or vLLM returns large sections like:

1
2
Thinking Process:
The user feels lonely. I should respond empathetically...

Or only English output, with no Chinese companionship sentences.

Root Cause

The Qwen3.5 chat_template defaults to enable_thinking=True (reasoning path).

Fix (Local)

1
2
3
4
5
6
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)

Fix (vLLM)

1
"chat_template_kwargs": {"enable_thinking": false}

Python SDK:

1
extra_body={"chat_template_kwargs": {"enable_thinking": False}}

Fallback

Even if disabled, occasional residues remain. extract_final_reply truncates based on the redacted_thinking tag. During training, apply_chat_template uses the full messages and does not involve thinking generation.


4. Pitfall 2: flash-linear-attention / causal-conv1d Not Installed

Log Excerpt

1
2
3
4
The fast path is not available because one of the required library is not installed.
Falling back to torch implementation.
To install follow https://github.com/fla-org/flash-linear-attention#installation
and https://github.com/Dao-AILab/causal-conv1d

Impact

Correctness ✅ Falls back to PyTorch; training and inference both complete
Speed Slightly slower; this project completed 750 steps in 41 minutes on V100
Mac MPS Same warning; user verification still succeeds

Is Installation Mandatory?

Not mandatory. If compilation fails in cloud environments, you can skip it. Install only if you need maximum throughput:

1
pip install flash-linear-attention causal-conv1d

5. Pitfall 3: Tokenizer and Model Config Token ID Mismatch

Log

1
2
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config...
Updated tokens: {'eos_token_id': 248046, 'pad_token_id': 248046}

Handling

In train_lora_single.py line 164:

1
tokenizer.pad_token = tokenizer.eos_token

The Trainer automatically aligns them at startup. No need to manually modify config.json.


6. Pitfall 4: Mac MPS + PEFT device_map

Symptom

Using device_map="auto" to load the base model, then PeftModel.from_pretrained on MPS causes device mismatch errors.

Fix

In verify_lora.py lines 110–134: Do not use device_map; explicitly use .to(device).

The training script still uses device_map={"": gpu_id} on CUDA — the device strategy for verification and training are deliberately different, do not copy blindly.


7. Pitfall 5: trust_remote_code=False

Qwen3.5 model classes are in Python files within the repository. Omitting trust_remote_code=True will cause:

1
ValueError: ... does not recognize this architecture

Must be set to True consistently in three places:

  • train_lora_single.py loading tokenizer/model
  • verify_lora.py loading
  • vLLM serving from a local path when the model directory is complete

8. Pitfall 6: pad_token_id Warning for Open-End Generation

During verification you may see:

1
Setting `pad_token_id` to `eos_token_id` for open-end generation.

This is a transformers hint during generate, consistent with the training-side pad setting. Can be ignored.


9. Three-End Checklist (Training / Verification / Deployment)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Training (train_lora_single.py)
□ trust_remote_code=True
□ pad_token = eos_token
□ add_generation_prompt=False

Verification (verify_lora.py)
□ enable_thinking=False
□ add_generation_prompt=True
□ SYSTEM_PROMPT matches JSONL
□ MPS: do NOT use device_map="auto" + PEFT

Deployment (vLLM)
□ chat_template_kwargs.enable_thinking=false
□ system prompt matches training
□ model name matches --lora-modules (Article 10)

10. Summary

  1. Thinking is the number one pitfall for Qwen3.5 verification/deployment; must be explicitly disabled + optional post-processing.
  2. flash-attn warnings only affect speed; training and verification work in practice.
  3. Mac MPS verification path uses a different device strategy than CUDA training path.
  4. When seeing “fine-tuning ineffective”, first check thinking, then system prompt.
  5. The next article implements deployment-side parameters in vLLM.

Appendix: Training vs Inference Template Comparison

Parameter train_lora_single.py verify_lora.py vLLM
add_generation_prompt False True Handled by API automatically
enable_thinking False chat_template_kwargs
messages include system Yes (JSONL) Yes Yes

Series Navigation

Article Link
Previous 08 · Effect Verification
Next 10 · vLLM Deployment
Index README

← Back to LoRA Elderly Companionship Topic