0. Series Closure
| Position in Series | Upstream | This Article Output | Downstream |
|---|---|---|---|
| Article 7/10 | Article 06: Training Complete | Metric interpretation, determining convergence | Article 08: Qualitative validation · Decide whether to add epochs |
This article interprets only existing logs, no fabricated curves. All numbers can be verified by grepping all_logs.log.
1. The Actual Problem
Common questions after training finishes:
- Loss 2.8 → 0.13, is that good?
- Why is the average train_loss 0.26 much larger than the last step’s 0.13?
- Does the loss rebound in the final steps (0.094 → 0.129) mean training is broken?
- Without eval_loss, how to judge overfitting?
2. Implementation Location
| File | Content |
|---|---|
LoRA_Demo/all_logs.log |
1548 lines, includes per-step Callback + summary |
output/lora_elderly_single/checkpoint-*/trainer_state.json |
Structured log_history |
LoRA_Demo/verify_lora.py → print_training_metrics() |
Quick print of first/last loss |
1 | |
3. Four Milestones (Log Excerpts)
| Step | Epoch | loss | mean_token_accuracy | Source |
|---|---|---|---|---|
| 1 | 0.00 | 2.8099 | 0.436 | log line 42 |
| 250 | 1.00 | 0.2402 | 0.957 | log line 540 |
| 500 | 2.00 | 0.1443 | 0.967 | log line 1040 |
| 750 | 3.00 | 0.1286 | 0.963 | log line 1540 |
Summary line (log line 1542):
1 | |
3.1 Three-Phase Interpretation
Phase A (Step 1–250): Rapid Decline
loss 2.81 → 0.24. The model learns:
- Token arrangement under Qwen chat template
- High-frequency patterns in the assistant’s empathetic phrasing
Phase B (250–500): Slowing Down
0.24 → 0.14. Starts fitting finer word choices.
Phase C (500–750): Diminishing Returns
0.14 → 0.13. Continuing training yields some benefit but not much; if adding a 4th epoch, one must be cautious of overfitting on 1000 samples.
3.2 train_loss 0.2587 vs Final 0.1286
train_loss is the arithmetic mean over 750 steps. The early steps (loss≈2.8) pull the average up.
Do not use 0.26 to judge “final result is poor”; look at the loss plateau near Epoch 3 (oscillating between 0.09–0.13).
3.3 Is Per-Step Fluctuation Normal?
From the end of the log:
1 | |
Single-step loss is affected by the two samples in the current batch. A slight rise at step 750 does not mean training collapsed; just look at the average over steps 700–750.
4. Token Accuracy
The Transformers log field mean_token_accuracy:
| Step | Accuracy |
|---|---|
| 1 | 43.6% |
| 250 | 95.7% |
| 750 | 96.3% |
In SFT, this is close to “whether the next token matches the label.” 96%+ indicates the model can already reproduce training set assistant wording; Article 08 is still needed to check generalization, because accuracy does not measure “whether the response is preachy.”
5. Learning Rate Schedule
| Step | lr |
|---|---|
| 1 | 2.0e-4 |
| 250 | 1.34e-4 |
| 500 | 6.69e-5 |
| 750 | 2.67e-7 |
Default linear decay. LoRA often uses constant lr or cosine; this project has converged with linear, no need to change the schedule and retrain afterward.
6. Throughput and Cost
| Metric | Value |
|---|---|
| train_runtime | 2484 s ≈ 41 min 23 s |
| s/it | ~3.31 s (tqdm) |
| samples/s | 1.208 |
Estimated cloud cost: V100 hourly price × 0.69 h. Acceptable for personal experiments.
7. How to Judge Without eval_loss
In this project, train_dataset = all 1000 samples, no hold-out set.
| Method | How |
|---|---|
| Qualitative | verify_lora.py defaults to 3 questions + custom user |
| Quantitative (not done) | Split 100 validation samples, eval_strategy="steps" |
| Overfitting signal | Training loss keeps decreasing but validation replies repeat training sentences |
Current loss plateau + validation examples (three custom users tested on Mac MPS) show style meets expectations, but no strict held-out metric was performed—validation set should be added for papers or product reports.
8. Checkpoint Selection
| Directory | Use Case |
|---|---|
final_lora |
Default deployment/verification |
checkpoint-700 |
A/B test if step 750 generation quality is abnormal |
checkpoint-500 |
Loss already low, try to reduce overfitting risk |
This project step 750 behaves normally, no need to roll back.
9. Pitfalls
Pitfall 1: Only looking at tqdm loss, ignoring Callback
Both should be consistent; if Callback produces no output, check if TrainingProgressCallback was removed.
Pitfall 2: Treating a single-step grad_norm spike as failure
Step 750 grad_norm=1.14 is slightly higher than the common 0.6–0.9; a single step is acceptable.
Pitfall 3: Expecting loss → 0
SFT loss is cross-entropy; labels have randomness (synonyms, different sentences), a plateau around 0.1 is normal.
Pitfall 4: Comparing logs from old versions with new parameter training
Changing batch/epochs alters the total step count; curves cannot be compared horizontally.
10. Summary
- 2.81 → 0.13, main gain in the 1st epoch.
- train_loss 0.2587 is the full-run average, not final quality.
- Token accuracy 96% only indicates fit to labels; semantic verification via inference is needed.
- 41 minutes / 750 steps is the V100 benchmark.
- Without an eval set, verify_lora.py is a necessary supplement.
Appendix: print_training_metrics Logic
1 | |
Series Navigation
| Article | Link |
|---|---|
| Previous | 06 · SFT Part 2 |
| Next | 08 · Verification |
| Index | README |