Qwen3.5 Pitfalls: Thinking Chain and flash-attn

0. Series Closure

This Article Position	Upstream	This Article Output	Downstream
Article 9/10	Article 08 Inference Verification	Qwen3.5 Specific Issue Checklist	Article 10 vLLM Request Parameters

The issues in this article are not unique to LoRA, but they can easily lead to false positives like “verification fails like a failure” or “API outputs English”, mistakenly attributed to “fine-tuning being ineffective.”

1. Actual Problems to Solve

Qwen3.5 introduces Thinking Mode compared to Qwen2: the model can first output its reasoning process through an internal channel, then give the final reply.

In elderly companionship scenarios:

The user expects short, empathetic Chinese sentences
The thinking chain is often in English, verbose, and looks like debug output
If not filtered, the product becomes completely unusable

Additionally: both all_logs.log and Mac verification show a flash-linear-attention not installed warning; we need to know whether it blocks training.

2. Implementation Locations

Issue	Code / Log Location
Disable thinking (local)	`verify_lora.py` lines 159–164 `enable_thinking=False`
Strip thinking residues	`verify_lora.py` lines 137–149 `extract_final_reply`
flash-attn warning	`all_logs.log` line 27; also appears during Mac verification
Disable thinking (vLLM)	`README.md` lines 213–233 `chat_template_kwargs`
trust_remote_code	`train_lora_single.py` lines 161, 180

3. Pitfall 1: Thinking Mode Not Disabled

Symptom

generate or vLLM returns large sections like:

1 2	`Thinking Process: The user feels lonely. I should respond empathetically...`

Or only English output, with no Chinese companionship sentences.

Root Cause

The Qwen3.5 chat_template defaults to enable_thinking=True (reasoning path).

Fix (Local)

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

Fix (vLLM)

1	`"chat_template_kwargs": {"enable_thinking": false}`

Python SDK:

1	`extra_body={"chat_template_kwargs": {"enable_thinking": False}}`

Fallback

Even if disabled, occasional residues remain. extract_final_reply truncates based on the redacted_thinking tag. During training, apply_chat_template uses the full messages and does not involve thinking generation.

4. Pitfall 2: flash-linear-attention / causal-conv1d Not Installed

Log Excerpt

The fast path is not available because one of the required library is not installed.
Falling back to torch implementation.
To install follow https://github.com/fla-org/flash-linear-attention#installation
and https://github.com/Dao-AILab/causal-conv1d

Impact


Correctness	✅ Falls back to PyTorch; training and inference both complete
Speed	Slightly slower; this project completed 750 steps in 41 minutes on V100
Mac MPS	Same warning; user verification still succeeds

Is Installation Mandatory?

Not mandatory. If compilation fails in cloud environments, you can skip it. Install only if you need maximum throughput:

1	`pip install flash-linear-attention causal-conv1d`

5. Pitfall 3: Tokenizer and Model Config Token ID Mismatch

Log

1 2	`The tokenizer has new PAD/BOS/EOS tokens that differ from the model config... Updated tokens: {'eos_token_id': 248046, 'pad_token_id': 248046}`

Handling

In train_lora_single.py line 164:

1	`tokenizer.pad_token = tokenizer.eos_token`

The Trainer automatically aligns them at startup. No need to manually modify config.json.

6. Pitfall 4: Mac MPS + PEFT device_map

Symptom

Using device_map="auto" to load the base model, then PeftModel.from_pretrained on MPS causes device mismatch errors.

Fix

In verify_lora.py lines 110–134: Do not use device_map; explicitly use .to(device).

The training script still uses device_map={"": gpu_id} on CUDA — the device strategy for verification and training are deliberately different, do not copy blindly.

7. Pitfall 5: trust_remote_code=False

Qwen3.5 model classes are in Python files within the repository. Omitting trust_remote_code=True will cause:

1	`ValueError: ... does not recognize this architecture`

Must be set to True consistently in three places:

train_lora_single.py loading tokenizer/model
verify_lora.py loading
vLLM serving from a local path when the model directory is complete

8. Pitfall 6: pad_token_id Warning for Open-End Generation

During verification you may see:

1	Setting `pad_token_id` to `eos_token_id` for open-end generation.

This is a transformers hint during generate, consistent with the training-side pad setting. Can be ignored.

9. Three-End Checklist (Training / Verification / Deployment)

Training (train_lora_single.py)
  □ trust_remote_code=True
  □ pad_token = eos_token
  □ add_generation_prompt=False

Verification (verify_lora.py)
  □ enable_thinking=False
  □ add_generation_prompt=True
  □ SYSTEM_PROMPT matches JSONL
  □ MPS: do NOT use device_map="auto" + PEFT

Deployment (vLLM)
  □ chat_template_kwargs.enable_thinking=false
  □ system prompt matches training
  □ model name matches --lora-modules (Article 10)

10. Summary

Thinking is the number one pitfall for Qwen3.5 verification/deployment; must be explicitly disabled + optional post-processing.
flash-attn warnings only affect speed; training and verification work in practice.
Mac MPS verification path uses a different device strategy than CUDA training path.
When seeing “fine-tuning ineffective”, first check thinking, then system prompt.
The next article implements deployment-side parameters in vLLM.

Appendix: Training vs Inference Template Comparison

Parameter	train_lora_single.py	verify_lora.py	vLLM
add_generation_prompt	False	True	Handled by API automatically
enable_thinking	—	False	chat_template_kwargs
messages include system	Yes (JSONL)	Yes	Yes

Article	Link
Previous	08 · Effect Verification
Next	10 · vLLM Deployment
Index	README

← Back to LoRA Elderly Companionship Topic

0. Series Closure

1. Actual Problems to Solve

2. Implementation Locations

3. Pitfall 1: Thinking Mode Not Disabled

Symptom

Root Cause

Fix (Local)

Fix (vLLM)

Fallback

4. Pitfall 2: flash-linear-attention / causal-conv1d Not Installed

Log Excerpt

Impact

Is Installation Mandatory?

5. Pitfall 3: Tokenizer and Model Config Token ID Mismatch

Log

Handling

6. Pitfall 4: Mac MPS + PEFT device_map

Symptom

Fix

7. Pitfall 5: trust_remote_code=False

8. Pitfall 6: pad_token_id Warning for Open-End Generation

9. Three-End Checklist (Training / Verification / Deployment)

10. Summary

Appendix: Training vs Inference Template Comparison

Series Navigation