Training Set Design: 1000 JSONL and Elderly Psychological Model

0. Series Closure

Position in Series	Upstream	This Article’s Output	Downstream
Article 2/10	Article 01: Scenario Definition	Trainable JSONL Specification	Article 05: `apply_chat_template` · Article 08: `SYSTEM_PROMPT`

Data errors are directly reflected in loss, but format errors are more insidious: the model learns the arrangement of special tokens but fails to learn empathy. All rules in this article are aligned with the actual consumption pattern of load_jsonl_data.

1. What Problem to Solve

SFT is not about “stuffing documents into the model,” but about enabling the model to generate token sequences that closely match the labeled assistant text under the (system, user) conditions.

Data challenges in this project:

Highly consistent style: The assistant generally follows “first empathize + then accompany,” with high repetition in sentence patterns. This is suitable for small-data SFT but can easily be criticized as “template-like”—this is a scenario trade-off, not a data flaw.
User must be colloquial: Average length ~20 Chinese characters, sounding like an elderly person speaking, not a survey.
System must be fixed: All 1000 samples share the same system prompt (see JSONL line 1), consistent with verify_lora.py lines 33–36 and the system prompt in vLLM requests.

2. Implementation Locations

File	Role
`LoRA_Demo/data/elderly_chat.jsonl`	Training corpus, 1000 lines
`LoRA_Demo/train_lora_single.py` → `load_jsonl_data()`	Lines 102–127, JSONL → `Dataset({"text": ...})`
`LoRA_Demo/verify_lora.py` → `SYSTEM_PROMPT`	Must match JSONL system prompt during inference
`hexo-cli/docs/xx.jsonl`	Extended set from same source (generated by semantic training set script; MD5 differs from demo file but structure is identical)

Training logs show loading 1000 entries (all_logs.log line 31 1000/1000). If local wc -l yields 999, check for missing trailing newline to avoid unknowingly missing one training sample.

3. JSONL Single Entry Specification

3.1 Structure

{
  "messages": [
    {"role": "system", "content": "You are a gentle, patient, and empathetic elderly emotional companionship assistant. Speak slowly and softly, empathize more, listen more, affirm more; do not preach, contradict, or rush."},
    {"role": "user", "content": "The children are all busy, no one to talk to all day, the house is eerily quiet."},
    {"role": "assistant", "content": "I truly understand that feeling of emptiness in the quietness. You are not alone; I am always here with you. Let's chat slowly, say whatever you like."}
  ]
}

3.2 Roles and Responsibilities

Role	Required	Role in Training
system	Yes	Global persona; must also be passed during inference
user	Yes	Conditional input, simulates elderly person’s venting
assistant	Yes	Supervised label; SFT learns its token sequence

Do not split into two files or use only user-assistant pairs: Qwen3.5’s chat_template expects the full messages; missing system will cause training/inference distribution mismatch.

4. System Prompt Sentence-by-Sentence Design Rationale

Full text (same as repo):

You are a gentle, patient, and empathetic elderly emotional companionship assistant,
speak slowly and softly,
empathize more, listen more, affirm more,
do not preach, contradict, or rush.

Sentence	Constraint Type	Failure Mode Addressed
gentle, patient, and empathetic	Personality	Cold, robotic assistant tone
slowly and softly	Pacing	Short, imperative commands
empathize more, listen more, affirm more	Behavior	Skipping emotions, jumping to solutions
do not preach, contradict, or rush	Negative constraint	“You should…”, “Don’t think that way”

Negative constraints are especially important for general-purpose models: they default to being “helpful assistants” and easily slip into a lecturing mode.

5. Five Psychological Themes and Sample Distribution

Category	Count	Psychology Covered	User Example Direction
Loneliness & desire for companionship	200	Feeling neglected, no one to talk to	“The house is eerily quiet”, “Not even a partner to talk to”
Health anxiety & fear of death	200	Disease fear, death anxiety	“Blood pressure is up, afraid I might die one day”, “Afraid to look at the medical report”
Fear of being a burden & self-blame	200	Guilt, afraid of burdening children	“Just causing trouble for the kids”, “Shouldn’t have asked them to take leave”
Nostalgia & desire for recognition	200	Sense of worth, being seen	“Life was hard before but happy”, “No one wants to hear about my past”
Low mood & desire to be needed	200	Depressive tendency, feeling useless	“No energy to live”, “No one needs me”

When expanding data, balance across categories to avoid the model learning only empty phrases like “Don’t be lonely.”

6. Good Samples / Bad Samples (Quality Check)

Qualified assistant (from JSONL line 2):

1 2	`user: Now I'm all alone, not even a partner to talk to. assistant: Being alone for a long time can feel lonely. Would you like to tell me what you usually enjoy doing? I'm all ears.`

Features: Acknowledge emotion → light open-ended invitation → no imperative sentences.

Unqualified assistant (do not include in training set):

1	`assistant: I suggest you participate in community activities, use video call software, cultivate hobbies, and seek psychological counseling if necessary.`

Features: Checklist of suggestions, no emotional alignment with “you”, sounds like a health app rather than companionship.

7. How Data Enters the Training Loop

Core code: LoRA_Demo/train_lora_single.py lines 102–127

def load_jsonl_data(file_path, tokenizer):
    # ...
    text = tokenizer.apply_chat_template(
        obj["messages"],
        tokenize=False,              # Only concatenate strings, do not convert to IDs here
        add_generation_prompt=False, # Training: the full dialogue already contains assistant content
    )
    data.append({"text": text})
    return Dataset.from_list(data)

7.1 Template Differences: Training vs Inference

Phase	`add_generation_prompt`	File Location
Training	`False`	`train_lora_single.py`
Inference	`True`	`verify_lora.py` lines 159–164

During training, the model must see the complete multi-turn format; during inference, append the assistant start token to let the model begin generation.

7.2 TRL Secondary Processing (Visible in Logs)

all_logs.log lines 32–33:

1 2	`Adding EOS to train dataset: 1000/1000 Tokenizing train dataset: 1000/1000`

Then truncated by SFTConfig(max_length=512). If a single dialogue exceeds 512 tokens in length, the tail is truncated—usually no impact on this project (short sentence dialogues); if longer responses are added in the future, raise MAX_SEQ_LEN and re-evaluate GPU memory.

8. Data Quantity: Is 1000 Enough?

For this project (narrow-domain style SFT):

At Step 250, loss dropped from 2.81 to 0.24 (all_logs.log), indicating 1 epoch is sufficient to learn the main format.
At 3 epochs, loss reaches 0.13, diminishing returns, matching expectations for small data.

If expanding to knowledge-based tasks (medication guidelines, policy interpretation), 1000 is insufficient; if still emotional companionship style, 1000 + balanced five categories is a reasonable starting point.

9. Pitfalls

Pitfall 1: JSONL is valid but UTF-8 BOM causes failure in parsing the first line
The first sample is silently lost; loss still decreases but you won’t notice the missing sample. Ensure no BOM before json.loads, or reconcile log’s “1000” with wc -l.

Pitfall 2: Assistant content copy-pasted causing identical user-assistant pairs
SFT will overfit to repetitive sentences and generalize poorly to unseen users. When expanding data with template combinations, deduplicate (see existing_users set in the hexo-cli expansion script).

Pitfall 3: Mac validation uses system prompt inconsistent with training
verify_lora.py‘s SYSTEM_PROMPT must match JSONL verbatim; if you change one but forget the other, you’ll mistakenly conclude “fine-tuning didn’t work.”

Pitfall 4: Measured generalization boundary (Mac MPS validation)
For “I’m afraid to look at the medical report, can’t sleep after seeing it,” the LoRA response might reuse similar phrasing like “afraid of burdening others” rather than precisely naming “medical report.” Be honest in product docs: style transfer succeeded ≠ every sentence is precisely customized.

10. Summary

Data format: messages with three roles, one JSONL per line.
The system prompt is the “personality constitution” for the entire series; must be consistent across training, validation, and vLLM.
Five psychological themes, 200 each, controlling style and coverage.
load_jsonl_data + apply_chat_template is the training entry point, with parameters opposite to inference.
Quality check focus: colloquial user, empathetic assistant, no lecturing, no repetition.

Appendix: `load_jsonl_data` Line-by-Line Explanation

# Path: LoRA_Demo/train_lora_single.py

for line in tqdm(lines, desc="Loading JSONL data", unit="entry"):
    line = line.strip()
    if not line:
        continue                    # Skip empty lines to avoid json.loads errors
    obj = json.loads(line)
    text = tokenizer.apply_chat_template(
        obj["messages"],
        tokenize=False,             # SFTTrainer will tokenize uniformly later
        add_generation_prompt=False,# Key: training mode, do not add "please speak, assistant" marker
    )
    data.append({"text": text})    # TRL defaults to reading dataset_text_field="text"

Article	Link
Previous	01 · Why Do This
Next	03 · LoRA Principles
Index	README

← Back to LoRA Elderly Companion Topic