Single-Card SFT Practical (Part 1): Tokenizer, Base Model, and LoRA Configuration

0. Series Completion

Position in Series	Upstream	This Output	Downstream
Article 5/10	Article 04: Environment Ready	Tokenizer, Base Model, LoraConfig, Dataset	Article 06: SFTTrainer.train()

By the end of this article: The model is on GPU, the LoRA configuration is defined, the data is in {"text": ...} format, but no gradient updates have occurred yet.

1. Practical Issues to Address

Beginners often get stuck when reading training scripts:

What does device_map={"": 0} mean, and why can’t it be written that way for multi-GPU?
Why do you need to write LoraConfig and then pass it to SFTTrainer?
What happens if you swap the two booleans in apply_chat_template?

This article aligns the execution order of the first four steps in main().

2. Implementation Locations

Symbol	Line (approx.)	Purpose
`parse_args()`	72–79	`--gpu_id`
`print_device_info()`	130–144	Print effective batch
`load_jsonl_data()`	102–127	JSONL → Dataset
`main()` Step 1–4	156–200	Scope of this article

Launch:

1 2	`python train_lora_single.py python train_lora_single.py --gpu_id 0`

3. Configuration Zone: Impact of Changes

# LoRA_Demo/train_lora_single.py Lines 49-65
MODEL_PATH = "./models/Qwen3.5-4B"
DATA_PATH = "./data/elderly_chat.jsonl"
OUTPUT_DIR = "./output/lora_elderly_single"

MAX_SEQ_LEN = 512
BATCH_SIZE = 2
LEARNING_RATE = 2e-4
EPOCHS = 3
GRADIENT_ACCUMULATION_STEPS = 2

LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05

Change	Impact
`BATCH_SIZE` / `GRADIENT_ACCUMULATION_STEPS`	GPU memory, effective batch size, total steps
`MAX_SEQ_LEN`	Linearly related to GPU memory, truncates long conversations
`EPOCHS`	Risk of overfitting, training time
`LORA_*`	Adapter size and expressiveness

4. Step 1: Tokenizer

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    padding_side="right",
)
tokenizer.pad_token = tokenizer.eos_token

4.1 Why padding_side=”right”

In causal language model training: the sequence predicts the next token from left to right. Padding on the right aligns valid content on the left, so the attention mask works correctly. Putting padding on the left would treat pad tokens as preceding context, contaminating the loss.

4.2 pad_token = eos_token

The Qwen series often lacks a dedicated pad token. The Trainer needs pad_token_id when batching; reusing eos is standard practice, consistent with the log’s pad_token_id: 248046.

4.3 Relationship with verify

verify_lora.py Line 97–98 also uses from_pretrained(BASE_MODEL, trust_remote_code=True). Do not use a different tokenizer path during verification.

5. Step 2: Base Model and device_map

if torch.cuda.is_available():
    device_map = {"": gpu_id}
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device_map = {"": "mps"}
else:
    device_map = {"": "cpu"}

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    dtype=torch.bfloat16,
    device_map=device_map,
    trust_remote_code=True,
)

5.1 Semantics of device_map={“”: 0}

HuggingFace Accelerate syntax: The key "" means “all modules not otherwise specified”, and the value 0 is the GPU ID. That is, the entire model goes on a single GPU.

Log comparison (all_logs.log Lines 15–16):

1 2	`Parallel strategy = Single GPU (model fully on GPU 0) device_map = {'': 0}`

5.2 dtype=bfloat16

Weights are loaded in bf16, matching the SFTConfig(bf16=True) autocast training in Step 6. V100 supports it; if the GPU does not support bf16, you must switch to fp16 and verify numerical stability experimentally.

5.3 LoRA is Not Injected Yet

If print_trainable_parameters() is called before Step 6, it should show close to 0% trainable parameters. If it shows a high percentage, you have mistakenly called get_peft_model on the model.

6. Step 3: LoraConfig (Definition Only)

lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)

After Step 3, the log prints Configuring LoRA..., then data is loaded—PEFT injection happens later in Step 6 when constructing the Trainer.

Wrong example:

1 2	`from peft import get_peft_model model = get_peft_model(model, lora_config) # ❌ Duplicated with SFTTrainer`

7. Step 4: load_jsonl_data

Full logic is covered in Article 02. Key recap:

text = tokenizer.apply_chat_template(
    obj["messages"],
    tokenize=False,
    add_generation_prompt=False,
)

Loading log (all_logs.log Line 31):

1	`Loading JSONL data: 1000/1000 [00:00<00:00, 5684.83 entries/s]`

After that, TRL will:

Adding EOS to train dataset
Tokenizing train dataset with max_length=512

TrainingProgressCallback

Lines 82–99 define the callback, passed into Step 6. With logging_steps=1, it prints loss step by step—the [Progress xx%] Step ... lines in all_logs.log come from this callback’s on_log.

8. Pitfalls

Pitfall 1: Training Qwen3.5-4B + LoRA on MPS
Although the script includes an MPS branch, training a 4B model on a Mac is extremely slow and prone to OOM. It is recommended to only run verify on Mac and use cloud GPUs for training.

Pitfall 2: Relative DATA_PATH
You must run python train_lora_single.py from the project root; otherwise ./data/... cannot be found.

Pitfall 3: Empty Lines in JSONL
load_jsonl_data skips empty lines; if there are many empty lines, len(dataset) will be less than the file line count, differing from the expected 1000.

Pitfall 4: Changing MAX_SEQ_LEN Without Re-estimating GPU Memory
Going from 512 → 1024 roughly doubles activation memory; on a V100 with batch=2 it may OOM.

9. Summary

Step 1: Tokenizer, right padding, pad=eos.
Step 2: Entire model on a single GPU with device_map={"": gpu_id}, loaded in bf16.
Step 3: Only define LoraConfig; do not manually call get_peft_model.
Step 4: messages → chat_template → Dataset(text=...).
Next step SFTTrainer injects LoRA and starts training (Article 06).

Appendix: Call Chain for First Four Steps in main()

# LoRA_Demo/train_lora_single.py

args = parse_args()
print_device_info(args.gpu_id)

tokenizer = AutoTokenizer.from_pretrained(...)     # Step 1
model = AutoModelForCausalLM.from_pretrained(...)  # Step 2
lora_config = LoraConfig(...)                      # Step 3
dataset = load_jsonl_data(DATA_PATH, tokenizer)    # Step 4
# → Next: SFTConfig + SFTTrainer (Article 06)

Article	Link
Previous	04 · Environment Setup
Next	06 · SFT in Practice (Part 2)
Index	README

← Back to LoRA Elderly Companion Series