0. Series Closed Loop

Position in Series Upstream This Post’s Output Downstream
Post 3/10 Post 02: Data into Model Understand r/alpha/target_modules Posts 05–06: Trainer Configuration · Post 10: vLLM --max-lora-rank

After reading this post, open train_lora_single.py and LoraConfig should no longer be a “hyperparameter black box”.


1. The Actual Problem to Solve

Full fine-tuning of Qwen3.5-4B (approx. 4,216,368,128 parameters) on a single V100 GPU:

  • Optimizer states consume huge memory, even with bf16 it’s tight
  • Each experiment saves 8GB+ checkpoints, iterations are slow
  • 1000 conversation examples relative to 4.2B parameters are extremely prone to overfitting all weights

LoRA’s core promise: only learn the “task delta” ΔW, and ΔW is low-rank and decomposable, reducing parameters to the tens of millions.

Measured in this project (line 34 of all_logs.log):

1
trainable params: 10,616,832 || all params: 4,216,368,128 || trainable%: 0.2518

2. Implementation Location

File Content
LoRA_Demo/train_lora_single.py LORA_R, LORA_ALPHA, LoraConfig(...)
LoRA_Demo/output/.../final_lora/adapter_config.json r=8, alpha=16, target_modules list after training
LoRA_Demo/output/.../final_lora/adapter_model.safetensors Trainable weights (~41 MB)

Note: LoRA is injected only when SFTTrainer(..., peft_config=lora_config) is created, not when LoraConfig(...) is defined.


3. Mathematical Form (Corresponding to Code)

For a linear layer, original forward:

[
y = W x
]

LoRA (default in peft):

[
y = W x + \frac{\alpha}{r} B A x
]

  • (W): frozen, from AutoModelForCausalLM.from_pretrained
  • (A \in \mathbb{R}^{r \times k}), (B \in \mathbb{R}^{d \times r}): trained
  • In code, LORA_R = 8 → (r=8)
  • LORA_ALPHA = 16 → scaling (\alpha/r = 2)
1
2
3
4
5
6
7
8
flowchart LR
x[Input x] --> W[Frozen W]
x --> A[Trainable A]
A --> B[Trainable B]
W --> add((+))
B --> scale["× α/r"]
scale --> add
add --> y[Output y]

Why low rank is sufficient: Style SFT modifies the conditional distribution of “how to say things,” which is a low-dimensional shift relative to the original model; there is no need to modify the full-rank (W).


4. Three Hyperparameters and Their Values in This Project

4.1 LORA_R = 8

r Parameter Count Use Case
4 Fewer Very narrow task, prevent overfitting
8 This project Balance point for 1000 style SFT examples
16+ More More complex behavior / multi-domain, use with caution on small data

When deploying with vLLM, you must use --max-lora-rank 8 to match the training r (see Post 10).

4.2 LORA_ALPHA = 16

Controls the effective step size of the LoRA branch on the output. Bigger is not always better: too large can cause oscillation, too small may not learn. 16/8=2 is a common empirical starting point.

4.3 LORA_DROPOUT = 0.05

Applied only to the LoRA branch. With 1000 examples containing repeated sentence patterns, slight dropout reduces rote memorization.


5. target_modules: Why Inject into Both Attention and FFN

Lines 189–192 of train_lora_single.py:

1
2
3
4
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],

Each Transformer layer in Qwen3.5:

1
2
Self-Attention: q_proj, k_proj, v_proj, o_proj
FFN (SwiGLU): gate_proj, up_proj, down_proj
Module Group Impact
Q/K/V/O Attention pattern: focus on user’s emotional words vs factual words
gate/up/down Feed-forward non-linearity: word usage habits, sentence rhythm

Injecting only q_proj, v_proj can also train, but style transfer is usually weaker than injecting all attention + FFN modules. The cost is an increase in parameters from millions to tens of millions—still only 0.25% for this project.

Verification from saved file (adapter_config.json):

1
2
3
4
5
"target_modules": ["v_proj", "k_proj", "up_proj", "down_proj", "q_proj", "gate_proj", "o_proj"],
"r": 8,
"lora_alpha": 16,
"lora_dropout": 0.05,
"bias": "none"

6. Integration with SFTTrainer (Common Pitfalls)

Correct Way

1
2
3
4
5
trainer = SFTTrainer(
model=model,
peft_config=lora_config, # Only injected here
...
)

Incorrect Way

1
2
model = get_peft_model(model, lora_config)  # Manual injection
trainer = SFTTrainer(model=model, peft_config=lora_config, ...) # Duplicate

Duplicate injection can cause unexpected behavior or errors. The script comments (lines 184, 224) explicitly state “do not call get_peft_model again.”

Before training, trainer.model.print_trainable_parameters() must show ~0.25%. If it shows 0% or 100%, stop immediately and check the configuration.


1
2
3
BATCH_SIZE = 2
GRADIENT_ACCUMULATION_STEPS = 2
# Effective batch = 4

Total steps:

[
\text{steps} = \lceil 1000 / 4 \rceil \times 3 = 750
]

This has no direct relation to LoRA, but determines how many times each sample is seen and the learning rate schedule length. Changing batch size does not change LoRA, but it changes training dynamics.


8. Pitfalls

Pitfall 1: Changing r without retraining the adapter
The old adapter’s adapter_config.json has r=8. If you manually change r=16 in the script and then load the old weights, shapes will not match.

Pitfall 2: Typo in target_modules
For example, q_projj. PEFT will silently skip that layer, and the trainable% will drop but may not throw an error.

Pitfall 3: Assuming LoRA saves GPU memory so you can arbitrarily increase seq_len
LoRA mainly saves trainable parameters and optimizer states; the forward pass still runs the full 4B base model, and activation memory for 512 tokens still exists. If OOM, reduce BATCH_SIZE or MAX_SEQ_LEN first (see Post 06).

Pitfall 4: vLLM --max-lora-rank less than training r
Startup fails or silently degrades; deployment parameters must be aligned with adapter_config.json.


9. Summary

  1. LoRA freezes (W) and trains low-rank (BA); in this project only 0.2518% of parameters are trainable.
  2. r=8, alpha=16, dropout=0.05 are set at the top of the script and saved in adapter_config.json.
  3. 7 target_modules cover attention + FFN, suitable for style SFT.
  4. Only inject via SFTTrainer(peft_config=...), do not duplicate with get_peft_model.
  5. During deployment, vLLM’s max-lora-rank must match r.

Appendix: LoraConfig Field Reference

1
2
3
4
5
6
7
8
9
10
# LoRA_Demo/train_lora_single.py lines 186-196

lora_config = LoraConfig(
r=LORA_R, # Rank r, determines A/B shapes
lora_alpha=LORA_ALPHA, # α, effective scaling α/r
target_modules=[...], # Module names must exactly match model layer names
lora_dropout=LORA_DROPOUT,
bias="none", # Don't train bias, save more parameters
task_type="CAUSAL_LM", # Causal LM, consistent with SFT
)

Series Navigation

Post Link
Previous 02 · Dataset Design
Next 04 · Environment Setup
Index README

← Back to LoRA Elderly Companion Topic