Understanding the LoRA Training Curve: 750-Step Real Trading Review

0. Series Closure

Position in Series	Upstream	This Article Output	Downstream
Article 7/10	Article 06: Training Complete	Metric interpretation, determining convergence	Article 08: Qualitative validation · Decide whether to add epochs

This article interprets only existing logs, no fabricated curves. All numbers can be verified by grepping all_logs.log.

1. The Actual Problem

Common questions after training finishes:

Loss 2.8 → 0.13, is that good?
Why is the average train_loss 0.26 much larger than the last step’s 0.13?
Does the loss rebound in the final steps (0.094 → 0.129) mean training is broken?
Without eval_loss, how to judge overfitting?

2. Implementation Location

File	Content
`LoRA_Demo/all_logs.log`	1548 lines, includes per-step Callback + summary
`output/lora_elderly_single/checkpoint-*/trainer_state.json`	Structured `log_history`
`LoRA_Demo/verify_lora.py` → `print_training_metrics()`	Quick print of first/last loss

1	`python verify_lora.py --metrics`

3. Four Milestones (Log Excerpts)

Step	Epoch	loss	mean_token_accuracy	Source
1	0.00	2.8099	0.436	log line 42
250	1.00	0.2402	0.957	log line 540
500	2.00	0.1443	0.967	log line 1040
750	3.00	0.1286	0.963	log line 1540

Summary line (log line 1542):

1 2	`{"train_runtime": "2484", "train_samples_per_second": "1.208", "train_steps_per_second": "0.302", "train_loss": "0.2587", "epoch": "3"}`

3.1 Three-Phase Interpretation

Phase A (Step 1–250): Rapid Decline
loss 2.81 → 0.24. The model learns:

Token arrangement under Qwen chat template
High-frequency patterns in the assistant’s empathetic phrasing

Phase B (250–500): Slowing Down
0.24 → 0.14. Starts fitting finer word choices.

Phase C (500–750): Diminishing Returns
0.14 → 0.13. Continuing training yields some benefit but not much; if adding a 4th epoch, one must be cautious of overfitting on 1000 samples.

3.2 train_loss 0.2587 vs Final 0.1286

train_loss is the arithmetic mean over 750 steps. The early steps (loss≈2.8) pull the average up.
Do not use 0.26 to judge “final result is poor”; look at the loss plateau near Epoch 3 (oscillating between 0.09–0.13).

3.3 Is Per-Step Fluctuation Normal?

From the end of the log:

1
2
3

Step 748: loss=0.0940
Step 749: loss=0.1099
Step 750: loss=0.1286

Single-step loss is affected by the two samples in the current batch. A slight rise at step 750 does not mean training collapsed; just look at the average over steps 700–750.

4. Token Accuracy

The Transformers log field mean_token_accuracy:

Step	Accuracy
1	43.6%
250	95.7%
750	96.3%

In SFT, this is close to “whether the next token matches the label.” 96%+ indicates the model can already reproduce training set assistant wording; Article 08 is still needed to check generalization, because accuracy does not measure “whether the response is preachy.”

5. Learning Rate Schedule

Step	lr
1	2.0e-4
250	1.34e-4
500	6.69e-5
750	2.67e-7

Default linear decay. LoRA often uses constant lr or cosine; this project has converged with linear, no need to change the schedule and retrain afterward.

6. Throughput and Cost

Metric	Value
train_runtime	2484 s ≈ 41 min 23 s
s/it	~3.31 s (tqdm)
samples/s	1.208

Estimated cloud cost: V100 hourly price × 0.69 h. Acceptable for personal experiments.

7. How to Judge Without eval_loss

In this project, train_dataset = all 1000 samples, no hold-out set.

Method	How
Qualitative	`verify_lora.py` defaults to 3 questions + custom user
Quantitative (not done)	Split 100 validation samples, `eval_strategy="steps"`
Overfitting signal	Training loss keeps decreasing but validation replies repeat training sentences

Current loss plateau + validation examples (three custom users tested on Mac MPS) show style meets expectations, but no strict held-out metric was performed—validation set should be added for papers or product reports.

8. Checkpoint Selection

Directory	Use Case
`final_lora`	Default deployment/verification
`checkpoint-700`	A/B test if step 750 generation quality is abnormal
`checkpoint-500`	Loss already low, try to reduce overfitting risk

This project step 750 behaves normally, no need to roll back.

9. Pitfalls

Pitfall 1: Only looking at tqdm loss, ignoring Callback
Both should be consistent; if Callback produces no output, check if TrainingProgressCallback was removed.

Pitfall 2: Treating a single-step grad_norm spike as failure
Step 750 grad_norm=1.14 is slightly higher than the common 0.6–0.9; a single step is acceptable.

Pitfall 3: Expecting loss → 0
SFT loss is cross-entropy; labels have randomness (synonyms, different sentences), a plateau around 0.1 is normal.

Pitfall 4: Comparing logs from old versions with new parameter training
Changing batch/epochs alters the total step count; curves cannot be compared horizontally.

10. Summary

2.81 → 0.13, main gain in the 1st epoch.
train_loss 0.2587 is the full-run average, not final quality.
Token accuracy 96% only indicates fit to labels; semantic verification via inference is needed.
41 minutes / 750 steps is the V100 benchmark.
Without an eval set, verify_lora.py is a necessary supplement.

Appendix: print_training_metrics Logic

# LoRA_Demo/verify_lora.py lines 60-94

logs = [x for x in state["log_history"] if "loss" in x]
first, last = logs[0], logs[-1]
print(f"  Starting loss: {first['loss']:.4f}  (step {first['step']})")
print(f"  Final loss: {last['loss']:.4f}  (step {last['step']})")
if "mean_token_accuracy" in last:
    print(f"  Final token accuracy: {last['mean_token_accuracy']:.1%}")

Article	Link
Previous	06 · SFT Part 2
Next	08 · Verification
Index	README

← Back to LoRA Elderly Companion Topic