LoRA Principle: Training Only 0.25% of Parameters

0. Series Closed Loop

Position in Series	Upstream	This Post’s Output	Downstream
Post 3/10	Post 02: Data into Model	Understand r/alpha/target_modules	Posts 05–06: Trainer Configuration · Post 10: vLLM `--max-lora-rank`

After reading this post, open train_lora_single.py and LoraConfig should no longer be a “hyperparameter black box”.

1. The Actual Problem to Solve

Full fine-tuning of Qwen3.5-4B (approx. 4,216,368,128 parameters) on a single V100 GPU:

Optimizer states consume huge memory, even with bf16 it’s tight
Each experiment saves 8GB+ checkpoints, iterations are slow
1000 conversation examples relative to 4.2B parameters are extremely prone to overfitting all weights

LoRA’s core promise: only learn the “task delta” ΔW, and ΔW is low-rank and decomposable, reducing parameters to the tens of millions.

Measured in this project (line 34 of all_logs.log):

1	`trainable params: 10,616,832 \|\| all params: 4,216,368,128 \|\| trainable%: 0.2518`

2. Implementation Location

File	Content
`LoRA_Demo/train_lora_single.py`	`LORA_R`, `LORA_ALPHA`, `LoraConfig(...)`
`LoRA_Demo/output/.../final_lora/adapter_config.json`	r=8, alpha=16, target_modules list after training
`LoRA_Demo/output/.../final_lora/adapter_model.safetensors`	Trainable weights (~41 MB)

Note: LoRA is injected only when SFTTrainer(..., peft_config=lora_config) is created, not when LoraConfig(...) is defined.

3. Mathematical Form (Corresponding to Code)

For a linear layer, original forward:

[
y = W x
]

LoRA (default in peft):

[
y = W x + \frac{\alpha}{r} B A x
]

(W): frozen, from AutoModelForCausalLM.from_pretrained
(A \in \mathbb{R}^{r \times k}), (B \in \mathbb{R}^{d \times r}): trained
In code, LORA_R = 8 → (r=8)
LORA_ALPHA = 16 → scaling (\alpha/r = 2)

flowchart LR
    x[Input x] --> W[Frozen W]
    x --> A[Trainable A]
    A --> B[Trainable B]
    W --> add((+))
    B --> scale["× α/r"]
    scale --> add
    add --> y[Output y]

Why low rank is sufficient: Style SFT modifies the conditional distribution of “how to say things,” which is a low-dimensional shift relative to the original model; there is no need to modify the full-rank (W).

4. Three Hyperparameters and Their Values in This Project

4.1 `LORA_R = 8`

r	Parameter Count	Use Case
4	Fewer	Very narrow task, prevent overfitting
8	This project	Balance point for 1000 style SFT examples
16+	More	More complex behavior / multi-domain, use with caution on small data

When deploying with vLLM, you must use --max-lora-rank 8 to match the training r (see Post 10).

4.2 `LORA_ALPHA = 16`

Controls the effective step size of the LoRA branch on the output. Bigger is not always better: too large can cause oscillation, too small may not learn. 16/8=2 is a common empirical starting point.

4.3 `LORA_DROPOUT = 0.05`

Applied only to the LoRA branch. With 1000 examples containing repeated sentence patterns, slight dropout reduces rote memorization.

5. target_modules: Why Inject into Both Attention and FFN

Lines 189–192 of train_lora_single.py:

target_modules=[
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
],

Each Transformer layer in Qwen3.5:

1 2	`Self-Attention: q_proj, k_proj, v_proj, o_proj FFN (SwiGLU): gate_proj, up_proj, down_proj`

Module Group	Impact
Q/K/V/O	Attention pattern: focus on user’s emotional words vs factual words
gate/up/down	Feed-forward non-linearity: word usage habits, sentence rhythm

Injecting only q_proj, v_proj can also train, but style transfer is usually weaker than injecting all attention + FFN modules. The cost is an increase in parameters from millions to tens of millions—still only 0.25% for this project.

Verification from saved file (adapter_config.json):

"target_modules": ["v_proj", "k_proj", "up_proj", "down_proj", "q_proj", "gate_proj", "o_proj"],
"r": 8,
"lora_alpha": 16,
"lora_dropout": 0.05,
"bias": "none"

6. Integration with SFTTrainer (Common Pitfalls)

Correct Way

trainer = SFTTrainer(
    model=model,
    peft_config=lora_config,  # Only injected here
    ...
)

Incorrect Way

1 2	`model = get_peft_model(model, lora_config) # Manual injection trainer = SFTTrainer(model=model, peft_config=lora_config, ...) # Duplicate`

Duplicate injection can cause unexpected behavior or errors. The script comments (lines 184, 224) explicitly state “do not call get_peft_model again.”

Before training, trainer.model.print_trainable_parameters() must show ~0.25%. If it shows 0% or 100%, stop immediately and check the configuration.

1
2
3

BATCH_SIZE = 2
GRADIENT_ACCUMULATION_STEPS = 2
# Effective batch = 4

Total steps:

[
\text{steps} = \lceil 1000 / 4 \rceil \times 3 = 750
]

This has no direct relation to LoRA, but determines how many times each sample is seen and the learning rate schedule length. Changing batch size does not change LoRA, but it changes training dynamics.

8. Pitfalls

Pitfall 1: Changing r without retraining the adapter
The old adapter’s adapter_config.json has r=8. If you manually change r=16 in the script and then load the old weights, shapes will not match.

Pitfall 2: Typo in target_modules
For example, q_projj. PEFT will silently skip that layer, and the trainable% will drop but may not throw an error.

Pitfall 3: Assuming LoRA saves GPU memory so you can arbitrarily increase seq_len
LoRA mainly saves trainable parameters and optimizer states; the forward pass still runs the full 4B base model, and activation memory for 512 tokens still exists. If OOM, reduce BATCH_SIZE or MAX_SEQ_LEN first (see Post 06).

Pitfall 4: vLLM --max-lora-rank less than training r
Startup fails or silently degrades; deployment parameters must be aligned with adapter_config.json.

9. Summary

LoRA freezes (W) and trains low-rank (BA); in this project only 0.2518% of parameters are trainable.
r=8, alpha=16, dropout=0.05 are set at the top of the script and saved in adapter_config.json.
7 target_modules cover attention + FFN, suitable for style SFT.
Only inject via SFTTrainer(peft_config=...), do not duplicate with get_peft_model.
During deployment, vLLM’s max-lora-rank must match r.

Appendix: `LoraConfig` Field Reference

# LoRA_Demo/train_lora_single.py lines 186-196

lora_config = LoraConfig(
    r=LORA_R,                 # Rank r, determines A/B shapes
    lora_alpha=LORA_ALPHA,    # α, effective scaling α/r
    target_modules=[...],     # Module names must exactly match model layer names
    lora_dropout=LORA_DROPOUT,
    bias="none",              # Don't train bias, save more parameters
    task_type="CAUSAL_LM",    # Causal LM, consistent with SFT
)

Post	Link
Previous	02 · Dataset Design
Next	04 · Environment Setup
Index	README

← Back to LoRA Elderly Companion Topic