子代理卡死防护实战:用 `stallTimeoutSeconds` 思路补齐 runTimeout 盲区
问题/场景:子代理停止工具调用后可能长期挂起,仅靠总时长 `runTimeoutSeconds` 无法区分“慢任务”和“卡死”。前置条件:可修改子代理默认配置并监控 `sessions_history`。实施步骤:1) 先统计最近卡死样本的“最后一次 tool call”间隔;2) 设定试运行阈值(如 120-300s)并在 staging 启用;3) 为高耗时任务保留更高阈值或白名单;4) 触发后自动告警并回收会话。关键配置:`agents.defaults.subagents.stallTimeoutSeconds`(提案)。验证方法:模拟无 tool call 卡死时,子代理可在阈值内自动终止且主会话收到提示。风险与边界:该配置来自功能提案,具体字段名与行为需验证。来源归因:Issue #39305。
GITHUBDiscovered 2026-03-08Author emcervini
Prerequisites
- You can edit OpenClaw subagent defaults and observe session execution telemetry.
- You have historical runs to estimate normal tool-call gaps for long jobs.
Steps
- Measure baseline idle gaps from completed subagent jobs to avoid too-aggressive timeout values.
- Enable a stall-timeout policy in staging (or an equivalent watchdog) and keep runTimeout as hard cap.
- Define exception classes for known long-idle workloads (e.g., external API backoff windows).
- On timeout trigger, auto-kill subagent and emit a parent-session summary with last tool timestamp.
Commands
openclaw gateway statusopenclaw helpopenclaw gateway restartVerify
Injected stalled runs are terminated within threshold while healthy long tasks still complete.
Caveats
- Feature is currently a proposal; production rollout should wait for upstream implementation details(需验证).
- Too-low thresholds can kill legitimate tasks during quiet phases.
Source attribution
This tip is aggregated from community/public sources and preserved with attribution.
Open original source ↗