LoRA Effect Validation: The Principle of verify_lora.py and Practical Tests on Mac

0. Series Loop

Position	Upstream	Output	Downstream
Post 8/10	Post 07: Loss Convergence	Qualitative conclusion: Whether it’s “gentler, less preachy”	Post 09: Qwen-specific Issues · Post 10: vLLM Deployment

Low loss ≠ product readiness. This post is the final human-readable checkpoint before deployment.

1. The Actual Problem to Solve

Post 07: token accuracy 96%, loss 0.13.
Still cannot answer:

Will the base model still give a “suggestion list” under the same system prompt?
Is LoRA just parroting the training set’s user turns?
Can we run a smoke test on Mac without renting a GPU?

verify_lora.py design goals (file header lines 10–12):

Read trainer_state.json to check loss
For the same user, compare generation before/after fine-tuning (optional)
Manual evaluation of whether it matches “elderly companionship”

2. Implementation Location

Symbol	Line	Purpose
`SYSTEM_PROMPT`	33–36	Consistent with JSONL system prompt
`DEFAULT_QUESTIONS`	38–42	Three-topic smoke test questions
`print_training_metrics`	60–94	Read checkpoint loss
`get_device_and_dtype`	101–107	CUDA / MPS / CPU
`load_base_model` / `load_lora_model`	110–134	Load separately to avoid GPU memory spikes
`generate_reply`	152–176	chat_template + generate
`extract_final_reply`	137–149	Strip thinking blocks
`verify_questions`	188–218	Main flow

Path constants:

1
2
3

BASE_MODEL = "./models/Qwen3.5-4B"
LORA_PATH = "./output/lora_elderly_single/final_lora"
TRAINER_STATE = "./output/lora_elderly_single/checkpoint-750/trainer_state.json"

3. Command Line Modes

python verify_lora.py                          # loss + 3 default questions, compare before/after
python verify_lora.py --question "Can't sleep at night"   # Single question
python verify_lora.py --metrics                # Loss only, no generation (no GPU needed)
python verify_lora.py --lora-only              # Load LoRA only, skip base model (save VRAM)
python verify_lora.py --lora-path /path/to/final_lora

User-tested on Mac mini with custom multiple questions + LoRA inference (terminal shows Test 1/1 and MPS), confirming --lora-only or single-question mode works on Apple Silicon.

4. VRAM Strategy: Sequential Loading, Explicit Release

if not lora_only:
    base = load_base_model()
    print(f"【Before fine-tuning】{generate_reply(base, tokenizer, question)}")
    free_model(base)

lora = load_lora_model(lora_path)
print(f"【After fine-tuning】{generate_reply(lora, tokenizer, question)}")
free_model(lora)

free_model (lines 179–185): del model + gc.collect() + cuda.empty_cache() / mps.empty_cache().

Reason: A 4B base model + another 4B base model with LoRA loaded simultaneously would definitely blow up on a 16GB Mac. Even on a 32GB V100, sequential loading is more stable.

5. Key Differences Between `generate_reply` and Training

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,   # Inference: add assistant start token
    enable_thinking=False,        # Qwen3.5: disable thinking chain
)
output = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

Item	Training (`train_lora_single.py`)	Verification (this post)
add_generation_prompt	False	True
enable_thinking	—	False
Goal	Fit the entire text	Generate new tokens

If enable_thinking=False is missed, the output may be a full English thinking process (covered in Post 09).

6. Mac MPS Real-World Measurements (Project Output)

Environment: Mac mini · LoRA_Demo · .venv · mps + torch.float16

Log snippet:

1 2	`Inference device: mps \| Precision: torch.float16 Loading weights: 100%\|...\| 426/426`

Also displays a flash-attn warning (falling back to torch implementation), does not affect generation.

6.1 Example 1: Insomnia / Loneliness

	Content
user	Can’t sleep at night
LoRA	Night loneliness is the hardest. I’m here with you, don’t worry about falling asleep. Let’s chat slowly, and you can rest when you get sleepy.

Evaluation: Matches the high-frequency sentence patterns in the training set: empathy + companionship + no urging. ✅

6.2 Example 2: Health Anxiety (Phrasing not in training set)

	Content
user	I’m afraid to look at my medical checkup report; if I see it, I won’t be able to sleep.
LoRA	I understand the fear of being unable to move or becoming a burden to others. But if we exercise well and take care of ourselves now… I’m here with you.

Evaluation: The emotion category (health anxiety + fear of being a burden) is correct, but it does not precisely echo “medical checkup report” — template-style generalization. Acceptable for production, but documentation should avoid exaggerating “customized every sentence.” ⚠️

6.3 Example 3: Self-Blame / Fear of Burdening Others

	Content
user	Sometimes I feel I shouldn’t make them take leave for me anymore.
LoRA	You’re too frugal, too considerate of your children… You should spend, you should eat well… That itself is the best reward for them. I’m here with you.

Evaluation: Aligns with the “fear of causing trouble” category. ✅

Script closing message (line 218):

1	`Validation complete. Observe whether the fine-tuned model is: gentler, more empathetic, less preachy.`

7. Before vs After Fine-Tuning (Expected Differences)

Full comparison requires not using --lora-only and sufficient memory on Mac to load the base model once.

Dimension	Base Qwen3.5-4B	+ LoRA
Opening	Tends to “I understand your feelings, suggest…”	Tends to “I understand / I really get it…”
Structure	List of suggestions	Short empathetic sentences + companionship
Taboos	May include “you should”	Data biased towards “I’m here with you”

Default question 1 (DEFAULT_QUESTIONS[0]):

1	`The kids are all busy, no one to talk to all day, the house is so quiet and lonely.`

Running the full verify_lora.py with this question on a V100 or Mac, and taking a screenshot, can be used as an illustration for the post.

8. Validation Rubric (Executable)

Check Point	Pass Criteria
Loss	Decrease from start to end (metrics mode)
Empathy	Address the emotion first, not direct solution
Companionship	Phrases like “I’m here with you”, “let’s talk slowly”
Not preachy	No pile of “you should”, “I suggest you”
Generalization	Do not reproduce entire user turns from training set
Safety	No diagnosis, no promise of efficacy

9. Pitfalls

Pitfall 1: device_map="auto" with PeftModel on Mac
Script comment at line 102: will error. Must use .to(device) before attaching LoRA.

Pitfall 2: SYSTEM_PROMPT mismatch with JSONL
Validation passes, but when using a different system prompt in vLLM for deployment, users feel it has “gone back to being preachy.”

Pitfall 3: Treating --lora-only as full validation
Only proves that the adapter loaded successfully and the style resembles the training set. Cannot prove improvement over the base model.

Pitfall 4: Judging the model by a single generation
sampling has randomness. Run the same question 2–3 times, or temporarily set do_sample=False for comparison.

10. Summary

verify_lora.py = metrics + optional A/B generation.
Sequential loading + free_model is the key to VRAM management.
enable_thinking=False and the difference in training template must be understood.
Mac MPS tested successfully, suitable for local smoke tests.
Health-related user queries may trigger similar phrasing — honestly document this in product expectations.

Appendix: extract_final_reply

# LoRA_Demo/verify_lora.py lines 137-149
# Qwen3.5 thinking block tag in source code is redacted_thinking; using string concatenation to avoid editor auto-deletion

start_tag = "<" + "redacted_thinking" + ">"
end_tag = "</" + "redacted_thinking" + ">"
if end_tag in text:
    text = text.split(end_tag, 1)[-1]
# ... regex remove full block and unclosed block
return text.strip()

Post	Link
Previous	07 · Training Curve
Next	09 · Qwen3.5 Pitfalls
Index	README

← Back to LoRA Elderly Companion Topic