0. Series Closure
| Position in Series | Upstream | This Article’s Output | Downstream |
|---|---|---|---|
| Article 2/10 | Article 01: Scenario Definition | Trainable JSONL Specification | Article 05: apply_chat_template · Article 08: SYSTEM_PROMPT |
Data errors are directly reflected in loss, but format errors are more insidious: the model learns the arrangement of special tokens but fails to learn empathy. All rules in this article are aligned with the actual consumption pattern of load_jsonl_data.
1. What Problem to Solve
SFT is not about “stuffing documents into the model,” but about enabling the model to generate token sequences that closely match the labeled assistant text under the (system, user) conditions.
Data challenges in this project:
- Highly consistent style: The assistant generally follows “first empathize + then accompany,” with high repetition in sentence patterns. This is suitable for small-data SFT but can easily be criticized as “template-like”—this is a scenario trade-off, not a data flaw.
- User must be colloquial: Average length ~20 Chinese characters, sounding like an elderly person speaking, not a survey.
- System must be fixed: All 1000 samples share the same system prompt (see JSONL line 1), consistent with
verify_lora.pylines 33–36 and the system prompt in vLLM requests.
2. Implementation Locations
| File | Role |
|---|---|
LoRA_Demo/data/elderly_chat.jsonl |
Training corpus, 1000 lines |
LoRA_Demo/train_lora_single.py → load_jsonl_data() |
Lines 102–127, JSONL → Dataset({"text": ...}) |
LoRA_Demo/verify_lora.py → SYSTEM_PROMPT |
Must match JSONL system prompt during inference |
hexo-cli/docs/xx.jsonl |
Extended set from same source (generated by semantic training set script; MD5 differs from demo file but structure is identical) |
Training logs show loading 1000 entries (all_logs.log line 31 1000/1000). If local wc -l yields 999, check for missing trailing newline to avoid unknowingly missing one training sample.
3. JSONL Single Entry Specification
3.1 Structure
1 | |
3.2 Roles and Responsibilities
| Role | Required | Role in Training |
|---|---|---|
| system | Yes | Global persona; must also be passed during inference |
| user | Yes | Conditional input, simulates elderly person’s venting |
| assistant | Yes | Supervised label; SFT learns its token sequence |
Do not split into two files or use only user-assistant pairs: Qwen3.5’s chat_template expects the full messages; missing system will cause training/inference distribution mismatch.
4. System Prompt Sentence-by-Sentence Design Rationale
Full text (same as repo):
1 | |
| Sentence | Constraint Type | Failure Mode Addressed |
|---|---|---|
| gentle, patient, and empathetic | Personality | Cold, robotic assistant tone |
| slowly and softly | Pacing | Short, imperative commands |
| empathize more, listen more, affirm more | Behavior | Skipping emotions, jumping to solutions |
| do not preach, contradict, or rush | Negative constraint | “You should…”, “Don’t think that way” |
Negative constraints are especially important for general-purpose models: they default to being “helpful assistants” and easily slip into a lecturing mode.
5. Five Psychological Themes and Sample Distribution
| Category | Count | Psychology Covered | User Example Direction |
|---|---|---|---|
| Loneliness & desire for companionship | 200 | Feeling neglected, no one to talk to | “The house is eerily quiet”, “Not even a partner to talk to” |
| Health anxiety & fear of death | 200 | Disease fear, death anxiety | “Blood pressure is up, afraid I might die one day”, “Afraid to look at the medical report” |
| Fear of being a burden & self-blame | 200 | Guilt, afraid of burdening children | “Just causing trouble for the kids”, “Shouldn’t have asked them to take leave” |
| Nostalgia & desire for recognition | 200 | Sense of worth, being seen | “Life was hard before but happy”, “No one wants to hear about my past” |
| Low mood & desire to be needed | 200 | Depressive tendency, feeling useless | “No energy to live”, “No one needs me” |
When expanding data, balance across categories to avoid the model learning only empty phrases like “Don’t be lonely.”
6. Good Samples / Bad Samples (Quality Check)
Qualified assistant (from JSONL line 2):
1 | |
Features: Acknowledge emotion → light open-ended invitation → no imperative sentences.
Unqualified assistant (do not include in training set):
1 | |
Features: Checklist of suggestions, no emotional alignment with “you”, sounds like a health app rather than companionship.
7. How Data Enters the Training Loop
Core code: LoRA_Demo/train_lora_single.py lines 102–127
1 | |
7.1 Template Differences: Training vs Inference
| Phase | add_generation_prompt |
File Location |
|---|---|---|
| Training | False |
train_lora_single.py |
| Inference | True |
verify_lora.py lines 159–164 |
During training, the model must see the complete multi-turn format; during inference, append the assistant start token to let the model begin generation.
7.2 TRL Secondary Processing (Visible in Logs)
all_logs.log lines 32–33:
1 | |
Then truncated by SFTConfig(max_length=512). If a single dialogue exceeds 512 tokens in length, the tail is truncated—usually no impact on this project (short sentence dialogues); if longer responses are added in the future, raise MAX_SEQ_LEN and re-evaluate GPU memory.
8. Data Quantity: Is 1000 Enough?
For this project (narrow-domain style SFT):
- At Step 250, loss dropped from 2.81 to 0.24 (
all_logs.log), indicating 1 epoch is sufficient to learn the main format. - At 3 epochs, loss reaches 0.13, diminishing returns, matching expectations for small data.
If expanding to knowledge-based tasks (medication guidelines, policy interpretation), 1000 is insufficient; if still emotional companionship style, 1000 + balanced five categories is a reasonable starting point.
9. Pitfalls
Pitfall 1: JSONL is valid but UTF-8 BOM causes failure in parsing the first line
The first sample is silently lost; loss still decreases but you won’t notice the missing sample. Ensure no BOM before json.loads, or reconcile log’s “1000” with wc -l.
Pitfall 2: Assistant content copy-pasted causing identical user-assistant pairs
SFT will overfit to repetitive sentences and generalize poorly to unseen users. When expanding data with template combinations, deduplicate (see existing_users set in the hexo-cli expansion script).
Pitfall 3: Mac validation uses system prompt inconsistent with trainingverify_lora.py‘s SYSTEM_PROMPT must match JSONL verbatim; if you change one but forget the other, you’ll mistakenly conclude “fine-tuning didn’t work.”
Pitfall 4: Measured generalization boundary (Mac MPS validation)
For “I’m afraid to look at the medical report, can’t sleep after seeing it,” the LoRA response might reuse similar phrasing like “afraid of burdening others” rather than precisely naming “medical report.” Be honest in product docs: style transfer succeeded ≠ every sentence is precisely customized.
10. Summary
- Data format:
messageswith three roles, one JSONL per line. - The system prompt is the “personality constitution” for the entire series; must be consistent across training, validation, and vLLM.
- Five psychological themes, 200 each, controlling style and coverage.
load_jsonl_data+apply_chat_templateis the training entry point, with parameters opposite to inference.- Quality check focus: colloquial user, empathetic assistant, no lecturing, no repetition.
Appendix: load_jsonl_data Line-by-Line Explanation
1 | |
Series Navigation
| Article | Link |
|---|---|
| Previous | 01 · Why Do This |
| Next | 03 · LoRA Principles |
| Index | README |