0. Series Closure

Position in Series Upstream This Article Output Downstream
Article 7/10 Article 06: Training Complete Metric interpretation, determining convergence Article 08: Qualitative validation · Decide whether to add epochs

This article interprets only existing logs, no fabricated curves. All numbers can be verified by grepping all_logs.log.


1. The Actual Problem

Common questions after training finishes:

  • Loss 2.8 → 0.13, is that good?
  • Why is the average train_loss 0.26 much larger than the last step’s 0.13?
  • Does the loss rebound in the final steps (0.094 → 0.129) mean training is broken?
  • Without eval_loss, how to judge overfitting?

2. Implementation Location

File Content
LoRA_Demo/all_logs.log 1548 lines, includes per-step Callback + summary
output/lora_elderly_single/checkpoint-*/trainer_state.json Structured log_history
LoRA_Demo/verify_lora.pyprint_training_metrics() Quick print of first/last loss
1
python verify_lora.py --metrics

3. Four Milestones (Log Excerpts)

Step Epoch loss mean_token_accuracy Source
1 0.00 2.8099 0.436 log line 42
250 1.00 0.2402 0.957 log line 540
500 2.00 0.1443 0.967 log line 1040
750 3.00 0.1286 0.963 log line 1540

Summary line (log line 1542):

1
2
{"train_runtime": "2484", "train_samples_per_second": "1.208",
"train_steps_per_second": "0.302", "train_loss": "0.2587", "epoch": "3"}

3.1 Three-Phase Interpretation

Phase A (Step 1–250): Rapid Decline
loss 2.81 → 0.24. The model learns:

  • Token arrangement under Qwen chat template
  • High-frequency patterns in the assistant’s empathetic phrasing

Phase B (250–500): Slowing Down
0.24 → 0.14. Starts fitting finer word choices.

Phase C (500–750): Diminishing Returns
0.14 → 0.13. Continuing training yields some benefit but not much; if adding a 4th epoch, one must be cautious of overfitting on 1000 samples.

3.2 train_loss 0.2587 vs Final 0.1286

train_loss is the arithmetic mean over 750 steps. The early steps (loss≈2.8) pull the average up.
Do not use 0.26 to judge “final result is poor”; look at the loss plateau near Epoch 3 (oscillating between 0.09–0.13).

3.3 Is Per-Step Fluctuation Normal?

From the end of the log:

1
2
3
Step 748: loss=0.0940
Step 749: loss=0.1099
Step 750: loss=0.1286

Single-step loss is affected by the two samples in the current batch. A slight rise at step 750 does not mean training collapsed; just look at the average over steps 700–750.


4. Token Accuracy

The Transformers log field mean_token_accuracy:

Step Accuracy
1 43.6%
250 95.7%
750 96.3%

In SFT, this is close to “whether the next token matches the label.” 96%+ indicates the model can already reproduce training set assistant wording; Article 08 is still needed to check generalization, because accuracy does not measure “whether the response is preachy.”


5. Learning Rate Schedule

Step lr
1 2.0e-4
250 1.34e-4
500 6.69e-5
750 2.67e-7

Default linear decay. LoRA often uses constant lr or cosine; this project has converged with linear, no need to change the schedule and retrain afterward.


6. Throughput and Cost

Metric Value
train_runtime 2484 s ≈ 41 min 23 s
s/it ~3.31 s (tqdm)
samples/s 1.208

Estimated cloud cost: V100 hourly price × 0.69 h. Acceptable for personal experiments.


7. How to Judge Without eval_loss

In this project, train_dataset = all 1000 samples, no hold-out set.

Method How
Qualitative verify_lora.py defaults to 3 questions + custom user
Quantitative (not done) Split 100 validation samples, eval_strategy="steps"
Overfitting signal Training loss keeps decreasing but validation replies repeat training sentences

Current loss plateau + validation examples (three custom users tested on Mac MPS) show style meets expectations, but no strict held-out metric was performed—validation set should be added for papers or product reports.


8. Checkpoint Selection

Directory Use Case
final_lora Default deployment/verification
checkpoint-700 A/B test if step 750 generation quality is abnormal
checkpoint-500 Loss already low, try to reduce overfitting risk

This project step 750 behaves normally, no need to roll back.


9. Pitfalls

Pitfall 1: Only looking at tqdm loss, ignoring Callback
Both should be consistent; if Callback produces no output, check if TrainingProgressCallback was removed.

Pitfall 2: Treating a single-step grad_norm spike as failure
Step 750 grad_norm=1.14 is slightly higher than the common 0.6–0.9; a single step is acceptable.

Pitfall 3: Expecting loss → 0
SFT loss is cross-entropy; labels have randomness (synonyms, different sentences), a plateau around 0.1 is normal.

Pitfall 4: Comparing logs from old versions with new parameter training
Changing batch/epochs alters the total step count; curves cannot be compared horizontally.


10. Summary

  1. 2.81 → 0.13, main gain in the 1st epoch.
  2. train_loss 0.2587 is the full-run average, not final quality.
  3. Token accuracy 96% only indicates fit to labels; semantic verification via inference is needed.
  4. 41 minutes / 750 steps is the V100 benchmark.
  5. Without an eval set, verify_lora.py is a necessary supplement.

Appendix: print_training_metrics Logic

1
2
3
4
5
6
7
8
# LoRA_Demo/verify_lora.py lines 60-94

logs = [x for x in state["log_history"] if "loss" in x]
first, last = logs[0], logs[-1]
print(f" Starting loss: {first['loss']:.4f} (step {first['step']})")
print(f" Final loss: {last['loss']:.4f} (step {last['step']})")
if "mean_token_accuracy" in last:
print(f" Final token accuracy: {last['mean_token_accuracy']:.1%}")

Series Navigation

Article Link
Previous 06 · SFT Part 2
Next 08 · Verification
Index README

← Back to LoRA Elderly Companion Topic