LangGraph Error Handling and Fault Tolerance Design: 5 Strategies to Prevent AI Agent System Crashes

0. Series Loop (Follow Along Without Open Source Code)

End-to-End Chain: Vue frontend → api/routes/chat.py → Guide multi-turn SSE → run_analysis_pipeline (parse→analyze→match→report) → tools/pdf_exporter PDF.
This article: 7/17 · Error Handling Loop · No Crash

Stage	User Visible	Code Entry	Article
Create session	Welcome message	POST /api/sessions	09
Multi-turn conversation	SSE streaming	chat/stream → run_guide_single_turn	06, 14
Info sufficient	Start analysis	_run_analysis_background	05, 07
Resume parsing	Progress 30%	run_resume_parser	12
Profile/RIASEC	Progress 50%	run_profile_analyzer	03, 13
Career matching	Progress 70%	run_career_matcher	02
Report	Progress 90%	run_reporter	11
Download PDF	File	GET …/report/pdf	11, 15

	Description
Before reading	Article 05 routing, Article 08 LLM calls
After reading	List fallback chains when Ollama is unavailable
Next loop	Article 09: Background task _run_analysis_background (Article 8)

Full series loop index: SERIES-LOOP.md

1. What Problem to Solve

The iCan top-level workflow chains 5 LLM-dependent nodes (Guide → ResumeParser → ProfileAnalyzer → CareerMatcher → Reporter). If any step times out, returns invalid JSON, or Ollama/cloud API goes down, without isolation the entire analysis would return 500, wasting user-entered conversations.

The project implements error handling at three levels:

Before invocation: llm/providers.py‘s check_ollama_available probes LLM reachability;
During invocation: invoke_llm / invoke_llm_with_json with 60s timeout + llm/parsers.py multi-strategy JSON extraction;
After invocation: try/except on each node in workflow.py, plus run_analysis_pipeline‘s phased catch and _generate_fallback_report rule-engine fallback.

Three-layer error handling architecture

2. Strategy 1: Health Check + 30-Second Cache

Before running the four analysis agents, run_analysis_pipeline first calls check_ollama_available() (function name is legacy; it actually probes the OpenAI-compatible /chat/completions at settings.LLM_BASE_URL, not limited to Ollama).

# llm/providers.py
_ollama_cache = {"available": True, "last_check": 0}

async def check_ollama_available() -> bool:
    now = _time.time()
    if now - _ollama_cache["last_check"] < 30:
        return _ollama_cache["available"]

    _ollama_cache["last_check"] = now
    base_url = settings.LLM_BASE_URL.rstrip("/")
    # ...
    resp = await client.post(
        f"{base_url}/chat/completions",
        json={
            "model": settings.LLM_MODEL_CHAT,
            "messages": [{"role": "user", "content": "hi"}],
            "max_tokens": 5,
        },
    )

Design highlights:

30-second cache: Avoids probing requests per session, reducing latency and quota overhead;
max_tokens=5: Minimizes probe cost;
Failure writes cache False: Subsequent 30 seconds quickly fall back without repeated timeouts.

When unavailable, workflow.py‘s run_analysis_pipeline skips the four LLM agents, switches to _regex_quick_profile + _generate_fallback_report, and marks ollama_unavailable: True in the DB.

3. Strategy 2: `asyncio.wait_for` Hard Timeout

llm/providers.py‘s invoke_llm wraps all Chat calls with a 60-second upper limit:

1	`response = await asyncio.wait_for(model.ainvoke(processed, **kwargs), timeout=60)`

On timeout, raises TimeoutError("AI model response timeout, please retry later"). get_chat_model() also has request_timeout=90 (HTTP layer); 60s is earlier cutoff at the application layer.

The API layer in api/routes/chat.py wraps another 90-second wait_for around run_guide_chat, offering users a friendlier “please retry later” message instead of a bare 500.

Empirical ranges (not hard rules): normal replies 2–5s, ProfileAnalyzer 10–30s, Reporter chapter generation may take 30–50s; over 60s is treated as abnormal.

4. Strategy 3: JSON Four-Level Degradation Parsing

Structured agents (ResumeParser, CareerMatcher, etc.) go through invoke_llm_with_json: first tries response_format=json_object, falls back to plain text if unsupported, then uses llm/parsers.py‘s parse_json_from_text:

Strategy 1: ```json ... ``` code block
  ↓ fail
Strategy 2: normal ``` ... ``` (starts with { or [)
  ↓ fail
Strategy 3: regex match outermost { ... }
  ↓ fail
Strategy 4: json.loads full text
  ↓ fail
Return {} (no exception thrown)

parse_json_from_text catches any JSONDecodeError and returns {}, ensuring the upstream always gets a dict. invoke_llm_with_json still raises ValueError on {} — that’s for “business must have JSON” scenarios, a different responsibility from the parser’s “try to extract”.

5. Strategy 4: Node-Level Exception Isolation

In workflow.py, the five top-level nodes each have try/except. On failure, they don’t raise but write safe defaults, allowing LangGraph to continue (or at least return a displayable state):

Node	Returns on exception
`guide_node`	Keep original `conversation_history`, `needs_more_info=True`
`resume_parser_node`	`structured_profile={}`
`profile_analyzer_node`	`personal_profile={}`
`career_matcher_node`	`career_matches=[]`
`reporter_node`	Fixed Markdown failure text

Example of reporter_node fallback:

except Exception as e:
    logger.error("[reporter_node] 报告输出节点执行异常: %s", e, exc_info=True)
    return {
        "final_report": "# iCan 职业规划报告\n\n报告生成失败，请稍后重试。",
        "current_agent": "reporter",
        "workflow_messages": [f"报告输出节点异常: {str(e)}"],
    }

Comparison: Without isolation, Reporter error → whole graph ainvoke fails → CLI/API 500; with isolation, users at least see failure description or partial sections.

When route_after_guide fails, it returns resume_parser_node — this is a “fail-open advance” at the routing layer, contrasting with guide node’s fail-closed (continue asking for info) — the routing layer fears dead loops more.

6. Strategy 5: `run_analysis_pipeline` Phased Error Handling

Online report generation mainly goes through run_analysis_pipeline (called by api/routes/chat.py, upload.py, report_gen.py), not through the top-level LangGraph guide loop. Its error handling is “each phase independent try, continue with empty data on failure”:

# workflow.py — simplified flow
try:
    parser_result = await run_resume_parser(parser_state)
    structured_profile = parser_result.get("structured_profile", {})
except Exception as parser_err:
    structured_profile = {}

if not structured_profile:
    structured_profile = {"basic_info": {"raw_text": combined_text[:500], "source": "fallback"}}

try:
    analyzer_result = await run_profile_analyzer(analyzer_state)
except Exception as analyzer_err:
    analyzer_result = {}

# matcher, reporter similarly...

When the Reporter phase fails, it doesn’t return an empty string; instead, it constructs Markdown containing a summary of personal_profile JSON and appends reporter_err at the end for easier OPS log correlation.

When LLM is completely unavailable, the entire LLM chain is skipped, and _generate_fallback_report outputs a rule-engine report with a ⚠️ note:

1	`sections.append("> ⚠️ 注意：AI 模型暂不可用，本报告基于规则引擎快速生成。")`

An outer catch still exists: log error, ws_manager.send_error to frontend, then raise — that’s for DB/session-level disasters, not single agent failures.

7. Interaction with Loop Limits (Article 5)

Error handling also includes anti-infinite loops (see Article 5 for details):

agents/guide.py should_continue: loop_count >= 8;
workflow.py route_after_guide: user_msg_count >= 3;
recursion_limit: subgraph 15, full workflow 50.

Loop limit exceeded is essentially “forced advancement”, preventing error + retry from forming a logical dead loop in the graph.

8. Error Handling Layer Overview

Request enters run_analysis_pipeline / run_workflow
  ↓
[1] check_ollama_available → unavailable → _regex_quick_profile + _generate_fallback_report
  ↓
[2] invoke_llm wait_for 60s → TimeoutError → caught by node/API layer
  ↓
[3] parse_json_from_text four layers → fail → {}
  ↓
[4] Each workflow node try/except → safe defaults
  ↓
[5] Pipeline phased try → empty dict/list continue + reporter summary fallback
  ↓
[6] Loop/recursion_limit → forced handoff / resume_parser
  ↓
Return final_report (complete, partial, or rule-engine version)

9. Pitfalls and Edge Cases

Misleading name check_ollama_available
It probes the current LLM_BASE_URL (could be DeepSeek, OpenAI, Ollama), not only Ollama. After switching to cloud in .env, if Ollama is down but cloud is up, it still caches True/False based on cloud result.
Health check default _ollama_cache["available"] = True
On process startup, before first probe, the first pipeline assumes available; if actually unavailable, it waits for the first POST failure to cache False. For high-availability scenarios, consider warm-up probing at startup.
Node isolation “empty dict continue” yields thin reports
When profile_analyzer fails, personal_profile has many empty fields, but Reporter still runs — users see “a report but content is thin”, better than 500, but should be distinguished via workflow_messages or progress prompts in frontend.
run_guide_chat exception has separate fallback
Returns fixed message “Sorry, something went wrong. Could you say that again?”, with is_info_sufficient=False, won’t accidentally trigger run_analysis_pipeline.
Reporter chapter generation uses get_chat_model()
Different from get_light_model(); do not assume Reporter has switched to mini model based on old comments (see Article 8 call table).
On error-handling path, Reporter may still be the slowest and most timeout-prone; rule-engine degradation only covers “entire LLM unavailable”, not “only Reporter timeout”.

10. Summary

Before invocation: llm/providers.py cached health check, workflow.py rule-engine report when unavailable.
During invocation: 60s timeout + llm/parsers.py multi-strategy JSON extraction.
After invocation: five workflow nodes isolated individually; run_analysis_pipeline phased catch, Reporter failure still produces summary version.
Goal is not “never fail”, but failure perceptible, degradable, not crashing the whole graph.
Next article (Article 8) expands on get_chat_model / get_light_model and unified LLM calling interface.

Appendix: Key Source Code (Line-by-Line Annotations)

The following code is excerpted from the iCan implementation. Each line has a Chinese comment above so you can follow along without the repository.
Generation command: python3 bin/build-ican-annotated-snippets.py

guide_node exception return

# ========== guide_node 异常返回 ==========
# 源文件: workflow.py  行 107-114

# L107: 捕获异常，避免整图/整请求崩溃
    except Exception as e:
# L108: 记录日志，便于线上排查节点入参/出参
        logger.error("[guide_node] 对话引导节点执行异常: %s", e, exc_info=True)
# L109: 返回本节点要合并进 state 的字段（LangGraph 会 merge）
        return {
# L110: 多轮对话列表，元素为 {role, content}
            "conversation_history": state.get("conversation_history", []),
# L111: 执行该语句（细节见上文业务描述）
            "current_agent": "guide",
# L112: 是否继续 Guide 循环；False 表示可以进 resume_parser
            "needs_more_info": True,
# L113: 执行该语句（细节见上文业务描述）
            "workflow_messages": [f"对话引导节点异常: {str(e)}"],
# L114: 执行该语句（细节见上文业务描述）
        }

Ollama unavailable → rule-based report

# ========== Ollama 不可用 → 规则报告 ==========
# 源文件: workflow.py  行 734-767

# L734: 导入依赖模块
        from ican.llm.providers import check_ollama_available
# L735: 探活 LLM 服务；失败则走规则引擎降级报告
        ollama_ok = await check_ollama_available()
# L736: 条件分支
        if not ollama_ok:
# L737: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.warning("[run_analysis_pipeline] Ollama 不可用，使用快速规则引擎生成报告")
# L738: 赋值：更新局部变量或 state 字段
            structured_profile = _regex_quick_profile(combined_text)
# L739: 赋值：更新局部变量或 state 字段
            final_report = _generate_fallback_report(structured_profile, combined_text)
# L740: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.info("[run_analysis_pipeline] 快速报告生成完成，长度=%d", len(final_report))
# L741: 开始 try 块，后续 except 负责兜底
            try:
# L742: 导入依赖模块
                from ican.db.session import get_db_session
# L743: 导入依赖模块
                from ican.db.repository import SessionRepository
# L744: 赋值：更新局部变量或 state 字段
                db = next(get_db_session())
# L745: 开始 try 块，后续 except 负责兜底
                try:
# L746: 赋值：更新局部变量或 state 字段
                    repo = SessionRepository(db)
# L747: 执行该语句（细节见上文业务描述）
                    repo.save_session(
# L748: 赋值：更新局部变量或 state 字段
                        session_id=session_id,
# L749: 赋值：更新局部变量或 state 字段
                        user_id=user_id or "system",
# L750: 赋值：更新局部变量或 state 字段
                        status="completed",
# L751: 赋值：更新局部变量或 state 字段
                        current_stage="report",
# L752: JSON 字段：存对话历史、中间结果、final_report 等
                        workflow_data={
# L753: 执行该语句（细节见上文业务描述）
                            "structured_profile": structured_profile,
# L754: 执行该语句（细节见上文业务描述）
                            "final_report": final_report,
# L755: 执行该语句（细节见上文业务描述）
                            "ollama_unavailable": True,
# L756: 执行该语句（细节见上文业务描述）
                        },
# L757: 执行该语句（细节见上文业务描述）
                    )
# L758: 无论成败都执行的清理逻辑
                finally:
# L759: 执行该语句（细节见上文业务描述）
                    db.close()
# L760: 捕获异常，避免整图/整请求崩溃
            except Exception as db_err:
# L761: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
                logger.error("[run_analysis_pipeline] 保存快速报告失败: %s", db_err)
# L762: 返回本节点要合并进 state 的字段（LangGraph 会 merge）
            return {
# L763: 执行该语句（细节见上文业务描述）
                "structured_profile": structured_profile,
# L764: 执行该语句（细节见上文业务描述）
                "personal_profile": {},
# L765: 执行该语句（细节见上文业务描述）
                "career_matches": [],
# L766: 执行该语句（细节见上文业务描述）
                "final_report": final_report,
# L767: 执行该语句（细节见上文业务描述）
            }

Pipeline phased try/except

# ========== pipeline 分阶段 try/except ==========
# 源文件: workflow.py  行 769-818

# L769: 赋值：更新局部变量或 state 字段
        parser_state = {"raw_input": combined_text, "input_type": "text"}
# L770: 开始 try 块，后续 except 负责兜底
        try:
# L771: 赋值：更新局部变量或 state 字段
            parser_result = await run_resume_parser(parser_state)
# L772: 赋值：更新局部变量或 state 字段
            structured_profile = parser_result.get("structured_profile", {})
# L773: 捕获异常，避免整图/整请求崩溃
        except Exception as parser_err:
# L774: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.error("[run_analysis_pipeline] 简历解析失败，使用空数据继续: %s", parser_err)
# L775: 赋值：更新局部变量或 state 字段
            structured_profile = {}

# L777: 条件分支
        if not structured_profile or len(structured_profile) == 0:
# L778: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.warning("[run_analysis_pipeline] 结构化画像为空，尝试从原始文本构建基础数据")
# L779: 赋值：更新局部变量或 state 字段
            structured_profile = {"basic_info": {"raw_text": combined_text[:500], "source": "fallback"}}

# L781: 开始 try 块，后续 except 负责兜底
        try:
# L782: 赋值：更新局部变量或 state 字段
            analyzer_state = {"structured_profile": structured_profile}
# L783: 赋值：更新局部变量或 state 字段
            analyzer_result = await run_profile_analyzer(analyzer_state)
# L784: 捕获异常，避免整图/整请求崩溃
        except Exception as analyzer_err:
# L785: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.error("[run_analysis_pipeline] 个人分析失败，使用空数据继续: %s", analyzer_err)
# L786: 赋值：更新局部变量或 state 字段
            analyzer_result = {}

# L788: 赋值：更新局部变量或 state 字段
        personal_profile = {
# L789: 执行该语句（细节见上文业务描述）
            "structured_profile": structured_profile,
# L790: 执行该语句（细节见上文业务描述）
            "ability_model": analyzer_result.get("ability_model", {}),
# L791: 执行该语句（细节见上文业务描述）
            "work_style": analyzer_result.get("work_style", {}),
# L792: 执行该语句（细节见上文业务描述）
            "personality_traits": analyzer_result.get("personality_traits", {}),
# L793: 执行该语句（细节见上文业务描述）
            "career_values": analyzer_result.get("career_values", {}),
# L794: 执行该语句（细节见上文业务描述）
            "riasec_scores": analyzer_result.get("riasec_scores", {}),
# L795: 执行该语句（细节见上文业务描述）
            "strengths": analyzer_result.get("strengths", []),
# L796: 执行该语句（细节见上文业务描述）
            "weaknesses": analyzer_result.get("weaknesses", []),
# L797: 执行该语句（细节见上文业务描述）
            "overall_summary": analyzer_result.get("structured_profile", {}).get("overall_summary", ""),
# L798: 执行该语句（细节见上文业务描述）
        }

# L800: 开始 try 块，后续 except 负责兜底
        try:
# L801: 赋值：更新局部变量或 state 字段
            matcher_state = {"personal_profile": personal_profile}
# L802: 赋值：更新局部变量或 state 字段
            matcher_result = await run_career_matcher(matcher_state)
# L803: 赋值：更新局部变量或 state 字段
            career_matches = matcher_result.get("recommended_paths", [])
# L804: 捕获异常，避免整图/整请求崩溃
        except Exception as matcher_err:
# L805: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.error("[run_analysis_pipeline] 职业匹配失败，使用空数据继续: %s", matcher_err)
# L806: 赋值：更新局部变量或 state 字段
            career_matches = []

# L808: 赋值：更新局部变量或 state 字段
        reporter_state = {
# L809: 执行该语句（细节见上文业务描述）
            "personal_profile": personal_profile,
# L810: 执行该语句（细节见上文业务描述）
            "career_matches": career_matches,
# L811: 执行该语句（细节见上文业务描述）
            "action_plan": {},
# L812: 执行该语句（细节见上文业务描述）
        }
# L813: 开始 try 块，后续 except 负责兜底
        try:
# L814: 赋值：更新局部变量或 state 字段
            reporter_result = await run_reporter(reporter_state)
# L815: 赋值：更新局部变量或 state 字段
            final_report = reporter_result.get("final_report", "")
# L816: 捕获异常，避免整图/整请求崩溃
        except Exception as reporter_err:
# L817: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.error("[run_analysis_pipeline] 报告生成失败: %s", reporter_err)
# L818: 赋值：更新局部变量或 state 字段
            final_report = f"# 职业规划报告\n\n基于您的简历分析，报告生成过程中遇到问题。\n\n## 个人画像摘要\n\n{json.dumps(personal_profile, ensure_ascii=False, default=str)[:2000]}\n\n*完整报告生成失败: {reporter_err}*"

Article	Topic
1	System Overview
2	Five-Agent Collaboration
3	RIASEC Holland Codes
4–7	State · Routing · Nesting · 7 Error Handling (This Article)
8–11	LLM Layer · SSE/WS · DB Migration · PDF
12–14	JSON Prompt · RIASEC Prompt · Guide Prompt
15–17	Docker · Middleware · Configuration

← Back to iCan Topic