0. Series Closed Loop
| Position in Series | Upstream | This Post’s Output | Downstream |
|---|---|---|---|
| Post 3/10 | Post 02: Data into Model | Understand r/alpha/target_modules | Posts 05–06: Trainer Configuration · Post 10: vLLM --max-lora-rank |
After reading this post, open train_lora_single.py and LoraConfig should no longer be a “hyperparameter black box”.
1. The Actual Problem to Solve
Full fine-tuning of Qwen3.5-4B (approx. 4,216,368,128 parameters) on a single V100 GPU:
- Optimizer states consume huge memory, even with bf16 it’s tight
- Each experiment saves 8GB+ checkpoints, iterations are slow
- 1000 conversation examples relative to 4.2B parameters are extremely prone to overfitting all weights
LoRA’s core promise: only learn the “task delta” ΔW, and ΔW is low-rank and decomposable, reducing parameters to the tens of millions.
Measured in this project (line 34 of all_logs.log):
1 | |
2. Implementation Location
| File | Content |
|---|---|
LoRA_Demo/train_lora_single.py |
LORA_R, LORA_ALPHA, LoraConfig(...) |
LoRA_Demo/output/.../final_lora/adapter_config.json |
r=8, alpha=16, target_modules list after training |
LoRA_Demo/output/.../final_lora/adapter_model.safetensors |
Trainable weights (~41 MB) |
Note: LoRA is injected only when SFTTrainer(..., peft_config=lora_config) is created, not when LoraConfig(...) is defined.
3. Mathematical Form (Corresponding to Code)
For a linear layer, original forward:
[
y = W x
]
LoRA (default in peft):
[
y = W x + \frac{\alpha}{r} B A x
]
- (W): frozen, from
AutoModelForCausalLM.from_pretrained - (A \in \mathbb{R}^{r \times k}), (B \in \mathbb{R}^{d \times r}): trained
- In code,
LORA_R = 8→ (r=8) LORA_ALPHA = 16→ scaling (\alpha/r = 2)
1 | |
Why low rank is sufficient: Style SFT modifies the conditional distribution of “how to say things,” which is a low-dimensional shift relative to the original model; there is no need to modify the full-rank (W).
4. Three Hyperparameters and Their Values in This Project
4.1 LORA_R = 8
| r | Parameter Count | Use Case |
|---|---|---|
| 4 | Fewer | Very narrow task, prevent overfitting |
| 8 | This project | Balance point for 1000 style SFT examples |
| 16+ | More | More complex behavior / multi-domain, use with caution on small data |
When deploying with vLLM, you must use --max-lora-rank 8 to match the training r (see Post 10).
4.2 LORA_ALPHA = 16
Controls the effective step size of the LoRA branch on the output. Bigger is not always better: too large can cause oscillation, too small may not learn. 16/8=2 is a common empirical starting point.
4.3 LORA_DROPOUT = 0.05
Applied only to the LoRA branch. With 1000 examples containing repeated sentence patterns, slight dropout reduces rote memorization.
5. target_modules: Why Inject into Both Attention and FFN
Lines 189–192 of train_lora_single.py:
1 | |
Each Transformer layer in Qwen3.5:
1 | |
| Module Group | Impact |
|---|---|
| Q/K/V/O | Attention pattern: focus on user’s emotional words vs factual words |
| gate/up/down | Feed-forward non-linearity: word usage habits, sentence rhythm |
Injecting only q_proj, v_proj can also train, but style transfer is usually weaker than injecting all attention + FFN modules. The cost is an increase in parameters from millions to tens of millions—still only 0.25% for this project.
Verification from saved file (adapter_config.json):
1 | |
6. Integration with SFTTrainer (Common Pitfalls)
Correct Way
1 | |
Incorrect Way
1 | |
Duplicate injection can cause unexpected behavior or errors. The script comments (lines 184, 224) explicitly state “do not call get_peft_model again.”
Before training, trainer.model.print_trainable_parameters() must show ~0.25%. If it shows 0% or 100%, stop immediately and check the configuration.
7. Effective Batch Size: Not Related to LoRA but Determines Step Count
1 | |
Total steps:
[
\text{steps} = \lceil 1000 / 4 \rceil \times 3 = 750
]
This has no direct relation to LoRA, but determines how many times each sample is seen and the learning rate schedule length. Changing batch size does not change LoRA, but it changes training dynamics.
8. Pitfalls
Pitfall 1: Changing r without retraining the adapter
The old adapter’s adapter_config.json has r=8. If you manually change r=16 in the script and then load the old weights, shapes will not match.
Pitfall 2: Typo in target_modules
For example, q_projj. PEFT will silently skip that layer, and the trainable% will drop but may not throw an error.
Pitfall 3: Assuming LoRA saves GPU memory so you can arbitrarily increase seq_len
LoRA mainly saves trainable parameters and optimizer states; the forward pass still runs the full 4B base model, and activation memory for 512 tokens still exists. If OOM, reduce BATCH_SIZE or MAX_SEQ_LEN first (see Post 06).
Pitfall 4: vLLM --max-lora-rank less than training r
Startup fails or silently degrades; deployment parameters must be aligned with adapter_config.json.
9. Summary
- LoRA freezes (W) and trains low-rank (BA); in this project only 0.2518% of parameters are trainable.
- r=8, alpha=16, dropout=0.05 are set at the top of the script and saved in
adapter_config.json. - 7 target_modules cover attention + FFN, suitable for style SFT.
- Only inject via
SFTTrainer(peft_config=...), do not duplicate withget_peft_model. - During deployment, vLLM’s max-lora-rank must match r.
Appendix: LoraConfig Field Reference
1 | |
Series Navigation
| Post | Link |
|---|---|
| Previous | 02 · Dataset Design |
| Next | 04 · Environment Setup |
| Index | README |