0. Series Closure

Position in Series Upstream This Article’s Output Downstream
Article 2/10 Article 01: Scenario Definition Trainable JSONL Specification Article 05: apply_chat_template · Article 08: SYSTEM_PROMPT

Data errors are directly reflected in loss, but format errors are more insidious: the model learns the arrangement of special tokens but fails to learn empathy. All rules in this article are aligned with the actual consumption pattern of load_jsonl_data.


1. What Problem to Solve

SFT is not about “stuffing documents into the model,” but about enabling the model to generate token sequences that closely match the labeled assistant text under the (system, user) conditions.

Data challenges in this project:

  1. Highly consistent style: The assistant generally follows “first empathize + then accompany,” with high repetition in sentence patterns. This is suitable for small-data SFT but can easily be criticized as “template-like”—this is a scenario trade-off, not a data flaw.
  2. User must be colloquial: Average length ~20 Chinese characters, sounding like an elderly person speaking, not a survey.
  3. System must be fixed: All 1000 samples share the same system prompt (see JSONL line 1), consistent with verify_lora.py lines 33–36 and the system prompt in vLLM requests.

2. Implementation Locations

File Role
LoRA_Demo/data/elderly_chat.jsonl Training corpus, 1000 lines
LoRA_Demo/train_lora_single.pyload_jsonl_data() Lines 102–127, JSONL → Dataset({"text": ...})
LoRA_Demo/verify_lora.pySYSTEM_PROMPT Must match JSONL system prompt during inference
hexo-cli/docs/xx.jsonl Extended set from same source (generated by semantic training set script; MD5 differs from demo file but structure is identical)

Training logs show loading 1000 entries (all_logs.log line 31 1000/1000). If local wc -l yields 999, check for missing trailing newline to avoid unknowingly missing one training sample.


3. JSONL Single Entry Specification

3.1 Structure

1
2
3
4
5
6
7
{
"messages": [
{"role": "system", "content": "You are a gentle, patient, and empathetic elderly emotional companionship assistant. Speak slowly and softly, empathize more, listen more, affirm more; do not preach, contradict, or rush."},
{"role": "user", "content": "The children are all busy, no one to talk to all day, the house is eerily quiet."},
{"role": "assistant", "content": "I truly understand that feeling of emptiness in the quietness. You are not alone; I am always here with you. Let's chat slowly, say whatever you like."}
]
}

3.2 Roles and Responsibilities

Role Required Role in Training
system Yes Global persona; must also be passed during inference
user Yes Conditional input, simulates elderly person’s venting
assistant Yes Supervised label; SFT learns its token sequence

Do not split into two files or use only user-assistant pairs: Qwen3.5’s chat_template expects the full messages; missing system will cause training/inference distribution mismatch.


4. System Prompt Sentence-by-Sentence Design Rationale

Full text (same as repo):

1
2
3
4
You are a gentle, patient, and empathetic elderly emotional companionship assistant,
speak slowly and softly,
empathize more, listen more, affirm more,
do not preach, contradict, or rush.
Sentence Constraint Type Failure Mode Addressed
gentle, patient, and empathetic Personality Cold, robotic assistant tone
slowly and softly Pacing Short, imperative commands
empathize more, listen more, affirm more Behavior Skipping emotions, jumping to solutions
do not preach, contradict, or rush Negative constraint “You should…”, “Don’t think that way”

Negative constraints are especially important for general-purpose models: they default to being “helpful assistants” and easily slip into a lecturing mode.


5. Five Psychological Themes and Sample Distribution

Category Count Psychology Covered User Example Direction
Loneliness & desire for companionship 200 Feeling neglected, no one to talk to “The house is eerily quiet”, “Not even a partner to talk to”
Health anxiety & fear of death 200 Disease fear, death anxiety “Blood pressure is up, afraid I might die one day”, “Afraid to look at the medical report”
Fear of being a burden & self-blame 200 Guilt, afraid of burdening children “Just causing trouble for the kids”, “Shouldn’t have asked them to take leave”
Nostalgia & desire for recognition 200 Sense of worth, being seen “Life was hard before but happy”, “No one wants to hear about my past”
Low mood & desire to be needed 200 Depressive tendency, feeling useless “No energy to live”, “No one needs me”

When expanding data, balance across categories to avoid the model learning only empty phrases like “Don’t be lonely.”


6. Good Samples / Bad Samples (Quality Check)

Qualified assistant (from JSONL line 2):

1
2
user: Now I'm all alone, not even a partner to talk to.
assistant: Being alone for a long time can feel lonely. Would you like to tell me what you usually enjoy doing? I'm all ears.

Features: Acknowledge emotion → light open-ended invitation → no imperative sentences.

Unqualified assistant (do not include in training set):

1
assistant: I suggest you participate in community activities, use video call software, cultivate hobbies, and seek psychological counseling if necessary.

Features: Checklist of suggestions, no emotional alignment with “you”, sounds like a health app rather than companionship.


7. How Data Enters the Training Loop

Core code: LoRA_Demo/train_lora_single.py lines 102–127

1
2
3
4
5
6
7
8
9
def load_jsonl_data(file_path, tokenizer):
# ...
text = tokenizer.apply_chat_template(
obj["messages"],
tokenize=False, # Only concatenate strings, do not convert to IDs here
add_generation_prompt=False, # Training: the full dialogue already contains assistant content
)
data.append({"text": text})
return Dataset.from_list(data)

7.1 Template Differences: Training vs Inference

Phase add_generation_prompt File Location
Training False train_lora_single.py
Inference True verify_lora.py lines 159–164

During training, the model must see the complete multi-turn format; during inference, append the assistant start token to let the model begin generation.

7.2 TRL Secondary Processing (Visible in Logs)

all_logs.log lines 32–33:

1
2
Adding EOS to train dataset: 1000/1000
Tokenizing train dataset: 1000/1000

Then truncated by SFTConfig(max_length=512). If a single dialogue exceeds 512 tokens in length, the tail is truncated—usually no impact on this project (short sentence dialogues); if longer responses are added in the future, raise MAX_SEQ_LEN and re-evaluate GPU memory.


8. Data Quantity: Is 1000 Enough?

For this project (narrow-domain style SFT):

  • At Step 250, loss dropped from 2.81 to 0.24 (all_logs.log), indicating 1 epoch is sufficient to learn the main format.
  • At 3 epochs, loss reaches 0.13, diminishing returns, matching expectations for small data.

If expanding to knowledge-based tasks (medication guidelines, policy interpretation), 1000 is insufficient; if still emotional companionship style, 1000 + balanced five categories is a reasonable starting point.


9. Pitfalls

Pitfall 1: JSONL is valid but UTF-8 BOM causes failure in parsing the first line
The first sample is silently lost; loss still decreases but you won’t notice the missing sample. Ensure no BOM before json.loads, or reconcile log’s “1000” with wc -l.

Pitfall 2: Assistant content copy-pasted causing identical user-assistant pairs
SFT will overfit to repetitive sentences and generalize poorly to unseen users. When expanding data with template combinations, deduplicate (see existing_users set in the hexo-cli expansion script).

Pitfall 3: Mac validation uses system prompt inconsistent with training
verify_lora.py‘s SYSTEM_PROMPT must match JSONL verbatim; if you change one but forget the other, you’ll mistakenly conclude “fine-tuning didn’t work.”

Pitfall 4: Measured generalization boundary (Mac MPS validation)
For “I’m afraid to look at the medical report, can’t sleep after seeing it,” the LoRA response might reuse similar phrasing like “afraid of burdening others” rather than precisely naming “medical report.” Be honest in product docs: style transfer succeeded ≠ every sentence is precisely customized.


10. Summary

  1. Data format: messages with three roles, one JSONL per line.
  2. The system prompt is the “personality constitution” for the entire series; must be consistent across training, validation, and vLLM.
  3. Five psychological themes, 200 each, controlling style and coverage.
  4. load_jsonl_data + apply_chat_template is the training entry point, with parameters opposite to inference.
  5. Quality check focus: colloquial user, empathetic assistant, no lecturing, no repetition.

Appendix: load_jsonl_data Line-by-Line Explanation

1
2
3
4
5
6
7
8
9
10
11
12
13
# Path: LoRA_Demo/train_lora_single.py

for line in tqdm(lines, desc="Loading JSONL data", unit="entry"):
line = line.strip()
if not line:
continue # Skip empty lines to avoid json.loads errors
obj = json.loads(line)
text = tokenizer.apply_chat_template(
obj["messages"],
tokenize=False, # SFTTrainer will tokenize uniformly later
add_generation_prompt=False,# Key: training mode, do not add "please speak, assistant" marker
)
data.append({"text": text}) # TRL defaults to reading dataset_text_field="text"

Series Navigation

Article Link
Previous 01 · Why Do This
Next 03 · LoRA Principles
Index README

← Back to LoRA Elderly Companion Topic