0. Series Completion
| Position in Series | Upstream | This Output | Downstream |
|---|---|---|---|
| Article 5/10 | Article 04: Environment Ready | Tokenizer, Base Model, LoraConfig, Dataset | Article 06: SFTTrainer.train() |
By the end of this article: The model is on GPU, the LoRA configuration is defined, the data is in {"text": ...} format, but no gradient updates have occurred yet.
1. Practical Issues to Address
Beginners often get stuck when reading training scripts:
- What does
device_map={"": 0}mean, and why can’t it be written that way for multi-GPU? - Why do you need to write
LoraConfigand then pass it toSFTTrainer? - What happens if you swap the two booleans in
apply_chat_template?
This article aligns the execution order of the first four steps in main().
2. Implementation Locations
| Symbol | Line (approx.) | Purpose |
|---|---|---|
parse_args() |
72–79 | --gpu_id |
print_device_info() |
130–144 | Print effective batch |
load_jsonl_data() |
102–127 | JSONL → Dataset |
main() Step 1–4 |
156–200 | Scope of this article |
Launch:
1 | |
3. Configuration Zone: Impact of Changes
1 | |
| Change | Impact |
|---|---|
BATCH_SIZE / GRADIENT_ACCUMULATION_STEPS |
GPU memory, effective batch size, total steps |
MAX_SEQ_LEN |
Linearly related to GPU memory, truncates long conversations |
EPOCHS |
Risk of overfitting, training time |
LORA_* |
Adapter size and expressiveness |
4. Step 1: Tokenizer
1 | |
4.1 Why padding_side=”right”
In causal language model training: the sequence predicts the next token from left to right. Padding on the right aligns valid content on the left, so the attention mask works correctly. Putting padding on the left would treat pad tokens as preceding context, contaminating the loss.
4.2 pad_token = eos_token
The Qwen series often lacks a dedicated pad token. The Trainer needs pad_token_id when batching; reusing eos is standard practice, consistent with the log’s pad_token_id: 248046.
4.3 Relationship with verify
verify_lora.py Line 97–98 also uses from_pretrained(BASE_MODEL, trust_remote_code=True). Do not use a different tokenizer path during verification.
5. Step 2: Base Model and device_map
1 | |
5.1 Semantics of device_map={“”: 0}
HuggingFace Accelerate syntax: The key "" means “all modules not otherwise specified”, and the value 0 is the GPU ID. That is, the entire model goes on a single GPU.
Log comparison (all_logs.log Lines 15–16):
1 | |
5.2 dtype=bfloat16
Weights are loaded in bf16, matching the SFTConfig(bf16=True) autocast training in Step 6. V100 supports it; if the GPU does not support bf16, you must switch to fp16 and verify numerical stability experimentally.
5.3 LoRA is Not Injected Yet
If print_trainable_parameters() is called before Step 6, it should show close to 0% trainable parameters. If it shows a high percentage, you have mistakenly called get_peft_model on the model.
6. Step 3: LoraConfig (Definition Only)
1 | |
After Step 3, the log prints Configuring LoRA..., then data is loaded—PEFT injection happens later in Step 6 when constructing the Trainer.
Wrong example:
1 | |
7. Step 4: load_jsonl_data
Full logic is covered in Article 02. Key recap:
1 | |
Loading log (all_logs.log Line 31):
1 | |
After that, TRL will:
Adding EOS to train datasetTokenizing train datasetwithmax_length=512
TrainingProgressCallback
Lines 82–99 define the callback, passed into Step 6. With logging_steps=1, it prints loss step by step—the [Progress xx%] Step ... lines in all_logs.log come from this callback’s on_log.
8. Pitfalls
Pitfall 1: Training Qwen3.5-4B + LoRA on MPS
Although the script includes an MPS branch, training a 4B model on a Mac is extremely slow and prone to OOM. It is recommended to only run verify on Mac and use cloud GPUs for training.
Pitfall 2: Relative DATA_PATH
You must run python train_lora_single.py from the project root; otherwise ./data/... cannot be found.
Pitfall 3: Empty Lines in JSONLload_jsonl_data skips empty lines; if there are many empty lines, len(dataset) will be less than the file line count, differing from the expected 1000.
Pitfall 4: Changing MAX_SEQ_LEN Without Re-estimating GPU Memory
Going from 512 → 1024 roughly doubles activation memory; on a V100 with batch=2 it may OOM.
9. Summary
- Step 1: Tokenizer, right padding, pad=eos.
- Step 2: Entire model on a single GPU with
device_map={"": gpu_id}, loaded in bf16. - Step 3: Only define
LoraConfig; do not manually callget_peft_model. - Step 4: messages → chat_template →
Dataset(text=...). - Next step SFTTrainer injects LoRA and starts training (Article 06).
Appendix: Call Chain for First Four Steps in main()
1 | |
Series Navigation
| Article | Link |
|---|---|
| Previous | 04 · Environment Setup |
| Next | 06 · SFT in Practice (Part 2) |
| Index | README |