0. Series Loop (Follow Along Without Open Source Code)

End-to-End Chain: Vue frontend → api/routes/chat.py → Guide multi-turn SSE → run_analysis_pipeline (parse→analyze→match→report) → tools/pdf_exporter PDF.
This article: 7/17 · Error Handling Loop · No Crash

Stage User Visible Code Entry Article
Create session Welcome message POST /api/sessions 09
Multi-turn conversation SSE streaming chat/stream → run_guide_single_turn 06, 14
Info sufficient Start analysis _run_analysis_background 05, 07
Resume parsing Progress 30% run_resume_parser 12
Profile/RIASEC Progress 50% run_profile_analyzer 03, 13
Career matching Progress 70% run_career_matcher 02
Report Progress 90% run_reporter 11
Download PDF File GET …/report/pdf 11, 15
Description
Before reading Article 05 routing, Article 08 LLM calls
After reading List fallback chains when Ollama is unavailable
Next loop Article 09: Background task _run_analysis_background (Article 8)

Full series loop index: SERIES-LOOP.md

1. What Problem to Solve

The iCan top-level workflow chains 5 LLM-dependent nodes (Guide → ResumeParser → ProfileAnalyzer → CareerMatcher → Reporter). If any step times out, returns invalid JSON, or Ollama/cloud API goes down, without isolation the entire analysis would return 500, wasting user-entered conversations.

The project implements error handling at three levels:

  1. Before invocation: llm/providers.py‘s check_ollama_available probes LLM reachability;
  2. During invocation: invoke_llm / invoke_llm_with_json with 60s timeout + llm/parsers.py multi-strategy JSON extraction;
  3. After invocation: try/except on each node in workflow.py, plus run_analysis_pipeline‘s phased catch and _generate_fallback_report rule-engine fallback.

Three-layer error handling architecture

2. Strategy 1: Health Check + 30-Second Cache

Before running the four analysis agents, run_analysis_pipeline first calls check_ollama_available() (function name is legacy; it actually probes the OpenAI-compatible /chat/completions at settings.LLM_BASE_URL, not limited to Ollama).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# llm/providers.py
_ollama_cache = {"available": True, "last_check": 0}

async def check_ollama_available() -> bool:
now = _time.time()
if now - _ollama_cache["last_check"] < 30:
return _ollama_cache["available"]

_ollama_cache["last_check"] = now
base_url = settings.LLM_BASE_URL.rstrip("/")
# ...
resp = await client.post(
f"{base_url}/chat/completions",
json={
"model": settings.LLM_MODEL_CHAT,
"messages": [{"role": "user", "content": "hi"}],
"max_tokens": 5,
},
)

Design highlights:

  • 30-second cache: Avoids probing requests per session, reducing latency and quota overhead;
  • max_tokens=5: Minimizes probe cost;
  • Failure writes cache False: Subsequent 30 seconds quickly fall back without repeated timeouts.

When unavailable, workflow.py‘s run_analysis_pipeline skips the four LLM agents, switches to _regex_quick_profile + _generate_fallback_report, and marks ollama_unavailable: True in the DB.

3. Strategy 2: asyncio.wait_for Hard Timeout

llm/providers.py‘s invoke_llm wraps all Chat calls with a 60-second upper limit:

1
response = await asyncio.wait_for(model.ainvoke(processed, **kwargs), timeout=60)

On timeout, raises TimeoutError("AI model response timeout, please retry later"). get_chat_model() also has request_timeout=90 (HTTP layer); 60s is earlier cutoff at the application layer.

The API layer in api/routes/chat.py wraps another 90-second wait_for around run_guide_chat, offering users a friendlier “please retry later” message instead of a bare 500.

Empirical ranges (not hard rules): normal replies 2–5s, ProfileAnalyzer 10–30s, Reporter chapter generation may take 30–50s; over 60s is treated as abnormal.

4. Strategy 3: JSON Four-Level Degradation Parsing

Structured agents (ResumeParser, CareerMatcher, etc.) go through invoke_llm_with_json: first tries response_format=json_object, falls back to plain text if unsupported, then uses llm/parsers.py‘s parse_json_from_text:

1
2
3
4
5
6
7
8
9
Strategy 1: ```json ... ``` code block
↓ fail
Strategy 2: normal ``` ... ``` (starts with { or [)
↓ fail
Strategy 3: regex match outermost { ... }
↓ fail
Strategy 4: json.loads full text
↓ fail
Return {} (no exception thrown)

parse_json_from_text catches any JSONDecodeError and returns {}, ensuring the upstream always gets a dict. invoke_llm_with_json still raises ValueError on {} — that’s for “business must have JSON” scenarios, a different responsibility from the parser’s “try to extract”.

5. Strategy 4: Node-Level Exception Isolation

In workflow.py, the five top-level nodes each have try/except. On failure, they don’t raise but write safe defaults, allowing LangGraph to continue (or at least return a displayable state):

Node Returns on exception
guide_node Keep original conversation_history, needs_more_info=True
resume_parser_node structured_profile={}
profile_analyzer_node personal_profile={}
career_matcher_node career_matches=[]
reporter_node Fixed Markdown failure text

Example of reporter_node fallback:

1
2
3
4
5
6
7
except Exception as e:
logger.error("[reporter_node] 报告输出节点执行异常: %s", e, exc_info=True)
return {
"final_report": "# iCan 职业规划报告\n\n报告生成失败,请稍后重试。",
"current_agent": "reporter",
"workflow_messages": [f"报告输出节点异常: {str(e)}"],
}

Comparison: Without isolation, Reporter error → whole graph ainvoke fails → CLI/API 500; with isolation, users at least see failure description or partial sections.

When route_after_guide fails, it returns resume_parser_node — this is a “fail-open advance” at the routing layer, contrasting with guide node’s fail-closed (continue asking for info) — the routing layer fears dead loops more.

6. Strategy 5: run_analysis_pipeline Phased Error Handling

Online report generation mainly goes through run_analysis_pipeline (called by api/routes/chat.py, upload.py, report_gen.py), not through the top-level LangGraph guide loop. Its error handling is “each phase independent try, continue with empty data on failure”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# workflow.py — simplified flow
try:
parser_result = await run_resume_parser(parser_state)
structured_profile = parser_result.get("structured_profile", {})
except Exception as parser_err:
structured_profile = {}

if not structured_profile:
structured_profile = {"basic_info": {"raw_text": combined_text[:500], "source": "fallback"}}

try:
analyzer_result = await run_profile_analyzer(analyzer_state)
except Exception as analyzer_err:
analyzer_result = {}

# matcher, reporter similarly...

When the Reporter phase fails, it doesn’t return an empty string; instead, it constructs Markdown containing a summary of personal_profile JSON and appends reporter_err at the end for easier OPS log correlation.

When LLM is completely unavailable, the entire LLM chain is skipped, and _generate_fallback_report outputs a rule-engine report with a ⚠️ note:

1
sections.append("> ⚠️ 注意:AI 模型暂不可用,本报告基于规则引擎快速生成。")

An outer catch still exists: log error, ws_manager.send_error to frontend, then raise — that’s for DB/session-level disasters, not single agent failures.

7. Interaction with Loop Limits (Article 5)

Error handling also includes anti-infinite loops (see Article 5 for details):

  • agents/guide.py should_continue: loop_count >= 8;
  • workflow.py route_after_guide: user_msg_count >= 3;
  • recursion_limit: subgraph 15, full workflow 50.

Loop limit exceeded is essentially “forced advancement”, preventing error + retry from forming a logical dead loop in the graph.

8. Error Handling Layer Overview

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Request enters run_analysis_pipeline / run_workflow

[1] check_ollama_available → unavailable → _regex_quick_profile + _generate_fallback_report

[2] invoke_llm wait_for 60s → TimeoutError → caught by node/API layer

[3] parse_json_from_text four layers → fail → {}

[4] Each workflow node try/except → safe defaults

[5] Pipeline phased try → empty dict/list continue + reporter summary fallback

[6] Loop/recursion_limit → forced handoff / resume_parser

Return final_report (complete, partial, or rule-engine version)

9. Pitfalls and Edge Cases

  1. Misleading name check_ollama_available
    It probes the current LLM_BASE_URL (could be DeepSeek, OpenAI, Ollama), not only Ollama. After switching to cloud in .env, if Ollama is down but cloud is up, it still caches True/False based on cloud result.

  2. Health check default _ollama_cache["available"] = True
    On process startup, before first probe, the first pipeline assumes available; if actually unavailable, it waits for the first POST failure to cache False. For high-availability scenarios, consider warm-up probing at startup.

  3. Node isolation “empty dict continue” yields thin reports
    When profile_analyzer fails, personal_profile has many empty fields, but Reporter still runs — users see “a report but content is thin”, better than 500, but should be distinguished via workflow_messages or progress prompts in frontend.

  4. run_guide_chat exception has separate fallback
    Returns fixed message “Sorry, something went wrong. Could you say that again?”, with is_info_sufficient=False, won’t accidentally trigger run_analysis_pipeline.

  5. Reporter chapter generation uses get_chat_model()
    Different from get_light_model(); do not assume Reporter has switched to mini model based on old comments (see Article 8 call table).
    On error-handling path, Reporter may still be the slowest and most timeout-prone; rule-engine degradation only covers “entire LLM unavailable”, not “only Reporter timeout”.

10. Summary

  • Before invocation: llm/providers.py cached health check, workflow.py rule-engine report when unavailable.
  • During invocation: 60s timeout + llm/parsers.py multi-strategy JSON extraction.
  • After invocation: five workflow nodes isolated individually; run_analysis_pipeline phased catch, Reporter failure still produces summary version.
  • Goal is not “never fail”, but failure perceptible, degradable, not crashing the whole graph.
  • Next article (Article 8) expands on get_chat_model / get_light_model and unified LLM calling interface.

Appendix: Key Source Code (Line-by-Line Annotations)

The following code is excerpted from the iCan implementation. Each line has a Chinese comment above so you can follow along without the repository.
Generation command: python3 bin/build-ican-annotated-snippets.py

guide_node exception return

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# ========== guide_node 异常返回 ==========
# 源文件: workflow.py 行 107-114

# L107: 捕获异常,避免整图/整请求崩溃
except Exception as e:
# L108: 记录日志,便于线上排查节点入参/出参
logger.error("[guide_node] 对话引导节点执行异常: %s", e, exc_info=True)
# L109: 返回本节点要合并进 state 的字段(LangGraph 会 merge)
return {
# L110: 多轮对话列表,元素为 {role, content}
"conversation_history": state.get("conversation_history", []),
# L111: 执行该语句(细节见上文业务描述)
"current_agent": "guide",
# L112: 是否继续 Guide 循环;False 表示可以进 resume_parser
"needs_more_info": True,
# L113: 执行该语句(细节见上文业务描述)
"workflow_messages": [f"对话引导节点异常: {str(e)}"],
# L114: 执行该语句(细节见上文业务描述)
}

Ollama unavailable → rule-based report

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# ========== Ollama 不可用 → 规则报告 ==========
# 源文件: workflow.py 行 734-767

# L734: 导入依赖模块
from ican.llm.providers import check_ollama_available
# L735: 探活 LLM 服务;失败则走规则引擎降级报告
ollama_ok = await check_ollama_available()
# L736: 条件分支
if not ollama_ok:
# L737: HTTP 主分析链:parse→analyze→match→report,跳过顶层 guide 环
logger.warning("[run_analysis_pipeline] Ollama 不可用,使用快速规则引擎生成报告")
# L738: 赋值:更新局部变量或 state 字段
structured_profile = _regex_quick_profile(combined_text)
# L739: 赋值:更新局部变量或 state 字段
final_report = _generate_fallback_report(structured_profile, combined_text)
# L740: HTTP 主分析链:parse→analyze→match→report,跳过顶层 guide 环
logger.info("[run_analysis_pipeline] 快速报告生成完成,长度=%d", len(final_report))
# L741: 开始 try 块,后续 except 负责兜底
try:
# L742: 导入依赖模块
from ican.db.session import get_db_session
# L743: 导入依赖模块
from ican.db.repository import SessionRepository
# L744: 赋值:更新局部变量或 state 字段
db = next(get_db_session())
# L745: 开始 try 块,后续 except 负责兜底
try:
# L746: 赋值:更新局部变量或 state 字段
repo = SessionRepository(db)
# L747: 执行该语句(细节见上文业务描述)
repo.save_session(
# L748: 赋值:更新局部变量或 state 字段
session_id=session_id,
# L749: 赋值:更新局部变量或 state 字段
user_id=user_id or "system",
# L750: 赋值:更新局部变量或 state 字段
status="completed",
# L751: 赋值:更新局部变量或 state 字段
current_stage="report",
# L752: JSON 字段:存对话历史、中间结果、final_report 等
workflow_data={
# L753: 执行该语句(细节见上文业务描述)
"structured_profile": structured_profile,
# L754: 执行该语句(细节见上文业务描述)
"final_report": final_report,
# L755: 执行该语句(细节见上文业务描述)
"ollama_unavailable": True,
# L756: 执行该语句(细节见上文业务描述)
},
# L757: 执行该语句(细节见上文业务描述)
)
# L758: 无论成败都执行的清理逻辑
finally:
# L759: 执行该语句(细节见上文业务描述)
db.close()
# L760: 捕获异常,避免整图/整请求崩溃
except Exception as db_err:
# L761: HTTP 主分析链:parse→analyze→match→report,跳过顶层 guide 环
logger.error("[run_analysis_pipeline] 保存快速报告失败: %s", db_err)
# L762: 返回本节点要合并进 state 的字段(LangGraph 会 merge)
return {
# L763: 执行该语句(细节见上文业务描述)
"structured_profile": structured_profile,
# L764: 执行该语句(细节见上文业务描述)
"personal_profile": {},
# L765: 执行该语句(细节见上文业务描述)
"career_matches": [],
# L766: 执行该语句(细节见上文业务描述)
"final_report": final_report,
# L767: 执行该语句(细节见上文业务描述)
}

Pipeline phased try/except

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# ========== pipeline 分阶段 try/except ==========
# 源文件: workflow.py 行 769-818

# L769: 赋值:更新局部变量或 state 字段
parser_state = {"raw_input": combined_text, "input_type": "text"}
# L770: 开始 try 块,后续 except 负责兜底
try:
# L771: 赋值:更新局部变量或 state 字段
parser_result = await run_resume_parser(parser_state)
# L772: 赋值:更新局部变量或 state 字段
structured_profile = parser_result.get("structured_profile", {})
# L773: 捕获异常,避免整图/整请求崩溃
except Exception as parser_err:
# L774: HTTP 主分析链:parse→analyze→match→report,跳过顶层 guide 环
logger.error("[run_analysis_pipeline] 简历解析失败,使用空数据继续: %s", parser_err)
# L775: 赋值:更新局部变量或 state 字段
structured_profile = {}

# L777: 条件分支
if not structured_profile or len(structured_profile) == 0:
# L778: HTTP 主分析链:parse→analyze→match→report,跳过顶层 guide 环
logger.warning("[run_analysis_pipeline] 结构化画像为空,尝试从原始文本构建基础数据")
# L779: 赋值:更新局部变量或 state 字段
structured_profile = {"basic_info": {"raw_text": combined_text[:500], "source": "fallback"}}

# L781: 开始 try 块,后续 except 负责兜底
try:
# L782: 赋值:更新局部变量或 state 字段
analyzer_state = {"structured_profile": structured_profile}
# L783: 赋值:更新局部变量或 state 字段
analyzer_result = await run_profile_analyzer(analyzer_state)
# L784: 捕获异常,避免整图/整请求崩溃
except Exception as analyzer_err:
# L785: HTTP 主分析链:parse→analyze→match→report,跳过顶层 guide 环
logger.error("[run_analysis_pipeline] 个人分析失败,使用空数据继续: %s", analyzer_err)
# L786: 赋值:更新局部变量或 state 字段
analyzer_result = {}

# L788: 赋值:更新局部变量或 state 字段
personal_profile = {
# L789: 执行该语句(细节见上文业务描述)
"structured_profile": structured_profile,
# L790: 执行该语句(细节见上文业务描述)
"ability_model": analyzer_result.get("ability_model", {}),
# L791: 执行该语句(细节见上文业务描述)
"work_style": analyzer_result.get("work_style", {}),
# L792: 执行该语句(细节见上文业务描述)
"personality_traits": analyzer_result.get("personality_traits", {}),
# L793: 执行该语句(细节见上文业务描述)
"career_values": analyzer_result.get("career_values", {}),
# L794: 执行该语句(细节见上文业务描述)
"riasec_scores": analyzer_result.get("riasec_scores", {}),
# L795: 执行该语句(细节见上文业务描述)
"strengths": analyzer_result.get("strengths", []),
# L796: 执行该语句(细节见上文业务描述)
"weaknesses": analyzer_result.get("weaknesses", []),
# L797: 执行该语句(细节见上文业务描述)
"overall_summary": analyzer_result.get("structured_profile", {}).get("overall_summary", ""),
# L798: 执行该语句(细节见上文业务描述)
}

# L800: 开始 try 块,后续 except 负责兜底
try:
# L801: 赋值:更新局部变量或 state 字段
matcher_state = {"personal_profile": personal_profile}
# L802: 赋值:更新局部变量或 state 字段
matcher_result = await run_career_matcher(matcher_state)
# L803: 赋值:更新局部变量或 state 字段
career_matches = matcher_result.get("recommended_paths", [])
# L804: 捕获异常,避免整图/整请求崩溃
except Exception as matcher_err:
# L805: HTTP 主分析链:parse→analyze→match→report,跳过顶层 guide 环
logger.error("[run_analysis_pipeline] 职业匹配失败,使用空数据继续: %s", matcher_err)
# L806: 赋值:更新局部变量或 state 字段
career_matches = []

# L808: 赋值:更新局部变量或 state 字段
reporter_state = {
# L809: 执行该语句(细节见上文业务描述)
"personal_profile": personal_profile,
# L810: 执行该语句(细节见上文业务描述)
"career_matches": career_matches,
# L811: 执行该语句(细节见上文业务描述)
"action_plan": {},
# L812: 执行该语句(细节见上文业务描述)
}
# L813: 开始 try 块,后续 except 负责兜底
try:
# L814: 赋值:更新局部变量或 state 字段
reporter_result = await run_reporter(reporter_state)
# L815: 赋值:更新局部变量或 state 字段
final_report = reporter_result.get("final_report", "")
# L816: 捕获异常,避免整图/整请求崩溃
except Exception as reporter_err:
# L817: HTTP 主分析链:parse→analyze→match→report,跳过顶层 guide 环
logger.error("[run_analysis_pipeline] 报告生成失败: %s", reporter_err)
# L818: 赋值:更新局部变量或 state 字段
final_report = f"# 职业规划报告\n\n基于您的简历分析,报告生成过程中遇到问题。\n\n## 个人画像摘要\n\n{json.dumps(personal_profile, ensure_ascii=False, default=str)[:2000]}\n\n*完整报告生成失败: {reporter_err}*"

Series Navigation

Article Topic
1 System Overview
2 Five-Agent Collaboration
3 RIASEC Holland Codes
4–7 State · Routing · Nesting · 7 Error Handling (This Article)
8–11 LLM Layer · SSE/WS · DB Migration · PDF
12–14 JSON Prompt · RIASEC Prompt · Guide Prompt
15–17 Docker · Middleware · Configuration

← Back to iCan Topic