0. Series Loop

Position Upstream Output Downstream
Post 8/10 Post 07: Loss Convergence Qualitative conclusion: Whether it’s “gentler, less preachy” Post 09: Qwen-specific Issues · Post 10: vLLM Deployment

Low loss ≠ product readiness. This post is the final human-readable checkpoint before deployment.


1. The Actual Problem to Solve

Post 07: token accuracy 96%, loss 0.13.
Still cannot answer:

  • Will the base model still give a “suggestion list” under the same system prompt?
  • Is LoRA just parroting the training set’s user turns?
  • Can we run a smoke test on Mac without renting a GPU?

verify_lora.py design goals (file header lines 10–12):

  1. Read trainer_state.json to check loss
  2. For the same user, compare generation before/after fine-tuning (optional)
  3. Manual evaluation of whether it matches “elderly companionship”

2. Implementation Location

Symbol Line Purpose
SYSTEM_PROMPT 33–36 Consistent with JSONL system prompt
DEFAULT_QUESTIONS 38–42 Three-topic smoke test questions
print_training_metrics 60–94 Read checkpoint loss
get_device_and_dtype 101–107 CUDA / MPS / CPU
load_base_model / load_lora_model 110–134 Load separately to avoid GPU memory spikes
generate_reply 152–176 chat_template + generate
extract_final_reply 137–149 Strip thinking blocks
verify_questions 188–218 Main flow

Path constants:

1
2
3
BASE_MODEL = "./models/Qwen3.5-4B"
LORA_PATH = "./output/lora_elderly_single/final_lora"
TRAINER_STATE = "./output/lora_elderly_single/checkpoint-750/trainer_state.json"

3. Command Line Modes

1
2
3
4
5
python verify_lora.py                          # loss + 3 default questions, compare before/after
python verify_lora.py --question "Can't sleep at night" # Single question
python verify_lora.py --metrics # Loss only, no generation (no GPU needed)
python verify_lora.py --lora-only # Load LoRA only, skip base model (save VRAM)
python verify_lora.py --lora-path /path/to/final_lora

User-tested on Mac mini with custom multiple questions + LoRA inference (terminal shows Test 1/1 and MPS), confirming --lora-only or single-question mode works on Apple Silicon.


4. VRAM Strategy: Sequential Loading, Explicit Release

1
2
3
4
5
6
7
8
if not lora_only:
base = load_base_model()
print(f"【Before fine-tuning】{generate_reply(base, tokenizer, question)}")
free_model(base)

lora = load_lora_model(lora_path)
print(f"【After fine-tuning】{generate_reply(lora, tokenizer, question)}")
free_model(lora)

free_model (lines 179–185): del model + gc.collect() + cuda.empty_cache() / mps.empty_cache().

Reason: A 4B base model + another 4B base model with LoRA loaded simultaneously would definitely blow up on a 16GB Mac. Even on a 32GB V100, sequential loading is more stable.


5. Key Differences Between generate_reply and Training

1
2
3
4
5
6
7
8
9
10
11
12
13
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True, # Inference: add assistant start token
enable_thinking=False, # Qwen3.5: disable thinking chain
)
output = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
Item Training (train_lora_single.py) Verification (this post)
add_generation_prompt False True
enable_thinking False
Goal Fit the entire text Generate new tokens

If enable_thinking=False is missed, the output may be a full English thinking process (covered in Post 09).


6. Mac MPS Real-World Measurements (Project Output)

Environment: Mac mini · LoRA_Demo · .venv · mps + torch.float16

Log snippet:

1
2
Inference device: mps  |  Precision: torch.float16
Loading weights: 100%|...| 426/426

Also displays a flash-attn warning (falling back to torch implementation), does not affect generation.

6.1 Example 1: Insomnia / Loneliness

Content
user Can’t sleep at night
LoRA Night loneliness is the hardest. I’m here with you, don’t worry about falling asleep. Let’s chat slowly, and you can rest when you get sleepy.

Evaluation: Matches the high-frequency sentence patterns in the training set: empathy + companionship + no urging. ✅

6.2 Example 2: Health Anxiety (Phrasing not in training set)

Content
user I’m afraid to look at my medical checkup report; if I see it, I won’t be able to sleep.
LoRA I understand the fear of being unable to move or becoming a burden to others. But if we exercise well and take care of ourselves now… I’m here with you.

Evaluation: The emotion category (health anxiety + fear of being a burden) is correct, but it does not precisely echo “medical checkup report” — template-style generalization. Acceptable for production, but documentation should avoid exaggerating “customized every sentence.” ⚠️

6.3 Example 3: Self-Blame / Fear of Burdening Others

Content
user Sometimes I feel I shouldn’t make them take leave for me anymore.
LoRA You’re too frugal, too considerate of your children… You should spend, you should eat well… That itself is the best reward for them. I’m here with you.

Evaluation: Aligns with the “fear of causing trouble” category. ✅

Script closing message (line 218):

1
Validation complete. Observe whether the fine-tuned model is: gentler, more empathetic, less preachy.

7. Before vs After Fine-Tuning (Expected Differences)

Full comparison requires not using --lora-only and sufficient memory on Mac to load the base model once.

Dimension Base Qwen3.5-4B + LoRA
Opening Tends to “I understand your feelings, suggest…” Tends to “I understand / I really get it…”
Structure List of suggestions Short empathetic sentences + companionship
Taboos May include “you should” Data biased towards “I’m here with you”

Default question 1 (DEFAULT_QUESTIONS[0]):

1
The kids are all busy, no one to talk to all day, the house is so quiet and lonely.

Running the full verify_lora.py with this question on a V100 or Mac, and taking a screenshot, can be used as an illustration for the post.


8. Validation Rubric (Executable)

Check Point Pass Criteria
Loss Decrease from start to end (metrics mode)
Empathy Address the emotion first, not direct solution
Companionship Phrases like “I’m here with you”, “let’s talk slowly”
Not preachy No pile of “you should”, “I suggest you”
Generalization Do not reproduce entire user turns from training set
Safety No diagnosis, no promise of efficacy

9. Pitfalls

Pitfall 1: device_map="auto" with PeftModel on Mac
Script comment at line 102: will error. Must use .to(device) before attaching LoRA.

Pitfall 2: SYSTEM_PROMPT mismatch with JSONL
Validation passes, but when using a different system prompt in vLLM for deployment, users feel it has “gone back to being preachy.”

Pitfall 3: Treating --lora-only as full validation
Only proves that the adapter loaded successfully and the style resembles the training set. Cannot prove improvement over the base model.

Pitfall 4: Judging the model by a single generation
sampling has randomness. Run the same question 2–3 times, or temporarily set do_sample=False for comparison.


10. Summary

  1. verify_lora.py = metrics + optional A/B generation.
  2. Sequential loading + free_model is the key to VRAM management.
  3. enable_thinking=False and the difference in training template must be understood.
  4. Mac MPS tested successfully, suitable for local smoke tests.
  5. Health-related user queries may trigger similar phrasing — honestly document this in product expectations.

Appendix: extract_final_reply

1
2
3
4
5
6
7
8
9
# LoRA_Demo/verify_lora.py lines 137-149
# Qwen3.5 thinking block tag in source code is redacted_thinking; using string concatenation to avoid editor auto-deletion

start_tag = "<" + "redacted_thinking" + ">"
end_tag = "</" + "redacted_thinking" + ">"
if end_tag in text:
text = text.split(end_tag, 1)[-1]
# ... regex remove full block and unclosed block
return text.strip()

Series Navigation

Post Link
Previous 07 · Training Curve
Next 09 · Qwen3.5 Pitfalls
Index README

← Back to LoRA Elderly Companion Topic