0. Series Closing the Loop
| This Article’s Position | Upstream | This Article’s Deliverable | Downstream |
|---|---|---|---|
| Article 4/10 | Article 03: Principles | Runnable GPU environment + base model path | Article 05: python train_lora_single.py |
The symptom of an incorrect environment is not an “error and exit”, but rather bf16 unsupported silently slowing down, model path symlink breaking, failing midway through training. This article follows the exact combinations that succeeded in real logs.
1. The Actual Problem to Solve
LoRA training depends on:
- CUDA + bf16 enabled PyTorch (V100 tested with 2.5.1+cu124)
- Complete Qwen3.5-4B weights at
MODEL_PATH - TRL / PEFT / bitsandbytes versions compatible with Transformers 5.x
- JSONL at
DATA_PATH
Verification and training can be on separate machines: training on AutoDL V100, verification on Mac mini MPS (user tested verify_lora.py successfully). vLLM is Linux+CUDA only (Article 10).
2. Implementation Locations
| Path | Description |
|---|---|
LoRA_Demo/README.md lines 31–54 |
pip dependencies and ModelScope download |
LoRA_Demo/train_lora_single.py lines 49–51 |
MODEL_PATH / DATA_PATH / OUTPUT_DIR |
LoRA_Demo/models/Qwen3.5-4B/ |
Base model directory (~8.7 GB) |
LoRA_Demo/.venv/ |
Local Python 3.11 virtual environment (optional) |
3. Python Dependencies & Version Pinning
1 | |
Versions embedded in checkpoint-750 (written to disk during training):
| Package | Version |
|---|---|
| PyTorch | 2.5.1+cu124 |
| transformers | 5.9.0 |
| trl | 1.5.1 |
| peft | 0.19.1 |
Do not use TrainingArguments + hand-written formatting_func from old TRL 0.7 tutorials as a reference for this project—the script uses TRL 1.x’s SFTConfig, with the parameter name max_length instead of the old max_seq_length (the Trainer may still print max_seq_length=512 internally; the script’s MAX_SEQ_LEN is authoritative).
4. Obtaining the Base Model
4.1 From ModelScope (China)
1 | |
4.2 Verification
1 | |
train_lora_single.py uses trust_remote_code=True—Qwen3.5 requires custom modeling code from its repository, do not omit.
5. Hardware & Precision Strategy
5.1 Training (from real logs)
1 | |
Source: all_logs.log lines 12–23.
V100 supports bf16, so train_lora_single.py lines 178, 210: dtype=torch.bfloat16 + bf16=True need no modification.
5.2 Verification (Mac MPS tested)
verify_lora.py lines 101–107:
1 | |
User screenshot shows: 推理设备: mps | 精度: torch.float16, LoRA loaded 426 weights successfully. This means verification does not require returning to V100.
5.3 Deployment
vLLM requires Linux + NVIDIA GPU; RTX 4090 / V100-32GB / A10-24GB are all fine (hardware table in README.md).
6. Pre-Training Self-Check Commands
1 | |
Expected: CUDA True, JSONL ~1000 lines, safetensors exist, trl ≥ 1.5.
7. Handling Two Types of Warnings in the Logs
7.1 flash-linear-attention not installed
1 | |
See Article 09. Does not block training; only slightly slower. 41 minutes for 750 steps is acceptable.
7.2 Tokenizer special token alignment
1 | |
The script already sets tokenizer.pad_token = tokenizer.eos_token (train_lora_single.py line 164). The Trainer automatically aligns config; can be ignored.
8. Pitfalls
Pitfall 1: Symlink pointing to wrong cache path
ModelScope updated and the hub path changed, leaving models/Qwen3.5-4B dangling, causing from_pretrained to raise FileNotFound. Check the link with ls -l models/Qwen3.5-4B.
Pitfall 2: Cloud image includes torch, pip overwrites with CPU versiontorch.cuda.is_available() becomes False. Always print CUDA before training.
Pitfall 3: Trying to pip install vllm on Mac for deployment
Mac does not support CUDA. For deployment, use a Linux cloud host; local machines only run verify_lora.py.
Pitfall 4: Disk space
Base model 8.7 GB + 15 checkpoints (~81 MB each, including optimizer) + final_lora. Reserve ≥2 GB in output/; only the final lora is ~50 MB.
9. Summary
- Dependencies: torch 2.5 + transformers 5.9 + trl 1.5 + peft 0.19.
- Base model: Download via ModelScope + symlink to
./models/Qwen3.5-4B. - Training: V100-32GB + bf16 + batch 2×2.
- Verification: Mac MPS works with float16, slightly different from training’s bf16 but tested successfully.
- Deployment: Linux only, see Article 10.
Appendix: Path Constants
1 | |
Series Navigation
| Article | Link |
|---|---|
| Previous | 03 · LoRA Principles |
| Next | 05 · SFT in Practice (Part 1) |
| Index | README |