子代理卡死防护实战：用 `stallTimeoutSeconds` 思路补齐 runTimeout 盲区

问题/场景：子代理停止工具调用后可能长期挂起，仅靠总时长 `runTimeoutSeconds` 无法区分“慢任务”和“卡死”。前置条件：可修改子代理默认配置并监控 `sessions_history`。实施步骤：1) 先统计最近卡死样本的“最后一次 tool call”间隔；2) 设定试运行阈值（如 120-300s）并在 staging 启用；3) 为高耗时任务保留更高阈值或白名单；4) 触发后自动告警并回收会话。关键配置：`agents.defaults.subagents.stallTimeoutSeconds`（提案）。验证方法：模拟无 tool call 卡死时，子代理可在阈值内自动终止且主会话收到提示。风险与边界：该配置来自功能提案，具体字段名与行为需验证。来源归因：Issue #39305。

GITHUBDiscovered 2026-03-08Author emcervini

Prerequisites

You can edit OpenClaw subagent defaults and observe session execution telemetry.
You have historical runs to estimate normal tool-call gaps for long jobs.

Steps

Measure baseline idle gaps from completed subagent jobs to avoid too-aggressive timeout values.
Enable a stall-timeout policy in staging (or an equivalent watchdog) and keep runTimeout as hard cap.
Define exception classes for known long-idle workloads (e.g., external API backoff windows).
On timeout trigger, auto-kill subagent and emit a parent-session summary with last tool timestamp.

Commands

openclaw gateway status

openclaw help

openclaw gateway restart

Verify

Injected stalled runs are terminated within threshold while healthy long tasks still complete.

Caveats

Feature is currently a proposal; production rollout should wait for upstream implementation details（需验证）.
Too-low thresholds can kill legitimate tasks during quiet phases.

Source attribution

This tip is aggregated from community/public sources and preserved with attribution.

Open original source ↗

Visit original post