0. Series Closing the Loop

This Article’s Position Upstream This Article’s Deliverable Downstream
Article 4/10 Article 03: Principles Runnable GPU environment + base model path Article 05: python train_lora_single.py

The symptom of an incorrect environment is not an “error and exit”, but rather bf16 unsupported silently slowing down, model path symlink breaking, failing midway through training. This article follows the exact combinations that succeeded in real logs.


1. The Actual Problem to Solve

LoRA training depends on:

  1. CUDA + bf16 enabled PyTorch (V100 tested with 2.5.1+cu124)
  2. Complete Qwen3.5-4B weights at MODEL_PATH
  3. TRL / PEFT / bitsandbytes versions compatible with Transformers 5.x
  4. JSONL at DATA_PATH

Verification and training can be on separate machines: training on AutoDL V100, verification on Mac mini MPS (user tested verify_lora.py successfully). vLLM is Linux+CUDA only (Article 10).


2. Implementation Locations

Path Description
LoRA_Demo/README.md lines 31–54 pip dependencies and ModelScope download
LoRA_Demo/train_lora_single.py lines 49–51 MODEL_PATH / DATA_PATH / OUTPUT_DIR
LoRA_Demo/models/Qwen3.5-4B/ Base model directory (~8.7 GB)
LoRA_Demo/.venv/ Local Python 3.11 virtual environment (optional)

3. Python Dependencies & Version Pinning

1
pip install torch transformers peft trl datasets accelerate bitsandbytes tqdm

Versions embedded in checkpoint-750 (written to disk during training):

Package Version
PyTorch 2.5.1+cu124
transformers 5.9.0
trl 1.5.1
peft 0.19.1

Do not use TrainingArguments + hand-written formatting_func from old TRL 0.7 tutorials as a reference for this project—the script uses TRL 1.x’s SFTConfig, with the parameter name max_length instead of the old max_seq_length (the Trainer may still print max_seq_length=512 internally; the script’s MAX_SEQ_LEN is authoritative).


4. Obtaining the Base Model

4.1 From ModelScope (China)

1
2
3
4
5
pip install modelscope
modelscope download --model Qwen/Qwen3.5-4B

mkdir -p models
ln -sf ~/.cache/modelscope/hub/models/Qwen/Qwen3.5-4B ./models/Qwen3.5-4B

4.2 Verification

1
2
3
4
5
6
test -f models/Qwen3.5-4B/config.json && echo OK
python -c "
from transformers import AutoConfig
c = AutoConfig.from_pretrained('./models/Qwen3.5-4B', trust_remote_code=True)
print(c.model_type, c.hidden_size)
"

train_lora_single.py uses trust_remote_code=True—Qwen3.5 requires custom modeling code from its repository, do not omit.


5. Hardware & Precision Strategy

5.1 Training (from real logs)

1
2
3
4
5
6
7
GPU 0: Tesla V100S-PCIE-32GB  |  显存: 31.7 GB
torch_dtype = bfloat16
bf16 = True
per_device_batch= 2
grad_accum = 2
effective batch= 4
max_seq_length= 512

Source: all_logs.log lines 12–23.

V100 supports bf16, so train_lora_single.py lines 178, 210: dtype=torch.bfloat16 + bf16=True need no modification.

5.2 Verification (Mac MPS tested)

verify_lora.py lines 101–107:

1
2
if torch.backends.mps.is_available():
return torch.device("mps"), torch.float16 # MPS uses fp16, not bf16

User screenshot shows: 推理设备: mps | 精度: torch.float16, LoRA loaded 426 weights successfully. This means verification does not require returning to V100.

5.3 Deployment

vLLM requires Linux + NVIDIA GPU; RTX 4090 / V100-32GB / A10-24GB are all fine (hardware table in README.md).


6. Pre-Training Self-Check Commands

1
2
3
4
5
6
7
cd LoRA_Demo
source .venv/bin/activate # if using venv

python -c "import torch; print('CUDA', torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A')"
wc -l data/elderly_chat.jsonl
ls -lh models/Qwen3.5-4B/*.safetensors 2>/dev/null | head -3
python -c "import trl, peft, transformers; print(trl.__version__, peft.__version__, transformers.__version__)"

Expected: CUDA True, JSONL ~1000 lines, safetensors exist, trl ≥ 1.5.


7. Handling Two Types of Warnings in the Logs

7.1 flash-linear-attention not installed

1
Falling back to torch implementation.

See Article 09. Does not block training; only slightly slower. 41 minutes for 750 steps is acceptable.

7.2 Tokenizer special token alignment

1
Updated tokens: {'eos_token_id': 248046, 'pad_token_id': 248046}

The script already sets tokenizer.pad_token = tokenizer.eos_token (train_lora_single.py line 164). The Trainer automatically aligns config; can be ignored.


8. Pitfalls

Pitfall 1: Symlink pointing to wrong cache path
ModelScope updated and the hub path changed, leaving models/Qwen3.5-4B dangling, causing from_pretrained to raise FileNotFound. Check the link with ls -l models/Qwen3.5-4B.

Pitfall 2: Cloud image includes torch, pip overwrites with CPU version
torch.cuda.is_available() becomes False. Always print CUDA before training.

Pitfall 3: Trying to pip install vllm on Mac for deployment
Mac does not support CUDA. For deployment, use a Linux cloud host; local machines only run verify_lora.py.

Pitfall 4: Disk space
Base model 8.7 GB + 15 checkpoints (~81 MB each, including optimizer) + final_lora. Reserve ≥2 GB in output/; only the final lora is ~50 MB.


9. Summary

  1. Dependencies: torch 2.5 + transformers 5.9 + trl 1.5 + peft 0.19.
  2. Base model: Download via ModelScope + symlink to ./models/Qwen3.5-4B.
  3. Training: V100-32GB + bf16 + batch 2×2.
  4. Verification: Mac MPS works with float16, slightly different from training’s bf16 but tested successfully.
  5. Deployment: Linux only, see Article 10.

Appendix: Path Constants

1
2
3
4
5
# LoRA_Demo/train_lora_single.py lines 49-51
MODEL_PATH = "./models/Qwen3.5-4B"
DATA_PATH = "./data/elderly_chat.jsonl"
OUTPUT_DIR = "./output/lora_elderly_single"
# Change paths only here; BASE_MODEL / LORA_PATH in verify_lora.py must be updated manually

Series Navigation

Article Link
Previous 03 · LoRA Principles
Next 05 · SFT in Practice (Part 1)
Index README

← Back to LoRA Elderly Companion Topic