Task Runs and Leases
How agents claim work, renew leases, finish runs, and release session-bound ownership safely.
- Audience
- Operators running durable agent work
- Focus
- Autonomy guidance shaped for scanability, day-two clarity, and operator context.
Task runs are the durable execution records for tasks. A run becomes claimable only after publish, start, approval, UI start, automation approval, or an equivalent API enqueues it. The task service is the only authority for run ownership and terminal state.
Claiming the next run
A managed agent session claims work either through the dedicated autonomy tool family or the parallel CLI. Both routes call the same task service writers and obey the same session-bound contract.
Tool path:
agh__task_run_claim_next { "wait": true, "lease_seconds": 300 }CLI path:
agh task next --wait --lease-seconds 300 -o jsonThe claim is atomic. AGH selects one eligible queued run, binds it to the current managed session, sets a lease deadline, and returns a synchronous claim response that includes:
- task summary
- run summary
- safe lease summary
claim_token_hashfor observability- coordination channel metadata when the run has a bound channel
The raw bearer lease token is internal to AGH. Public CLI, HTTP, UDS, native-tool, web, stream,
log, channel, and memory payloads use the calling session plus run_id and expose at most
claim_token_hash. Tools belonging to the agh__autonomy toolset reject any input or response
field that would carry a raw claim token.
Lease rules
The MVP lease contract is intentionally narrow:
| Rule | Behavior |
|---|---|
| One owner | Exactly one managed session may own a non-terminal run lease. |
| One active lease per session | A managed session may hold at most one active task-run lease in the MVP. |
| Session fencing | Heartbeat, complete, fail, and release resolve the internal lease from the caller session and run_id. |
| Bounded renewal | --lease-seconds must be zero or positive and is capped by the task service. Omitted or zero uses the service default. |
| Expiry recovery | Expired leases are recovered by boot recovery or scheduler sweeps through the task service. |
| Stale holders fail | A stale heartbeat or late complete after recovery fails explicitly; it does not extend or overwrite a newer claim. |
Heartbeat
Use heartbeat when work is still active and the session still owns the run.
Tool path:
agh__task_run_heartbeat { "run_id": "run-123", "lease_seconds": 300 }CLI path:
agh task heartbeat run-123 --lease-seconds 300Heartbeats are task-service operations. Do not mirror routine heartbeats into coordination channel messages unless a human-readable update is useful.
Completing a run
Use complete for successful terminal state.
Tool path:
agh__task_run_complete { "run_id": "run-123", "result": { "summary": "tests passed" } }CLI path:
agh task complete run-123 \
--result '{"summary":"tests passed"}'The optional result JSON must not contain raw lease credentials.
Auto-enqueue on ready
Tasks are opt-in for dependency-driven auto-enqueue. When a task carries
auto_enqueue_on_ready, AGH enqueues its next run automatically as soon as a blocking dependency
completes and the task becomes ready — no manual enqueue is required to advance the DAG.
Set it at create time, or toggle it later on an assembled tree:
agh task create --scope global --title "Deploy" --auto-enqueue-on-ready
agh task update task-123 --auto-enqueue-on-ready # turn on
agh task update task-123 --auto-enqueue-on-ready=false # turn offThe behavior is deliberately conservative:
- Readiness gates it. Only a successful completion reconciles dependents to
ready; a failed or expired blocker does not satisfy ablocksedge, so it never triggers a premature enqueue. - Paused dependents are skipped. An effectively paused dependent is left untouched until it resumes.
- At most one open run. Enqueue reuses the canonical
task_runspath. The store's queued-run reservation rejects a second open run, so concurrent completions of different blockers — or a retried completion — converge on exactly one queued run, never duplicates. - Completion never rolls back. Auto-enqueue runs after the completion has durably committed and is best-effort: it survives request cancellation and a failed enqueue is logged, not propagated back to the completing caller.
Failing a claimed run
Use the session-bound fail path when the current claim cannot complete successfully. This path is
token-fenced server-side: the caller supplies a run_id, and AGH resolves the managed session's
active lease before mutating the run.
Tool path:
agh__task_run_fail { "run_id": "run-123", "error": "tests failed", "metadata": { "command": "make test" } }CLI path:
agh task run fail run-123 \
--error "tests failed" \
--metadata '{"command":"make test"}'Failure metadata must not contain raw lease credentials.
Releasing a claimed run
Use session-bound release when the current managed session should give up ownership without making
the run terminal. The autonomy tool resolves the internal lease from the caller session and
run_id; raw claim tokens never cross the public surface.
Tool path:
agh__task_run_release { "run_id": "run-123", "reason": "handoff" }Release is also used by daemon-owned cleanup. For example, if a spawned child reaches TTL or its
parent stops, AGH releases active child leases with structured reasons such as ttl_expired or
parent_stopped before stopping the child session.
Force operations
Use force operations when a run needs operator recovery and the normal session-bound lease path is
not available or should not be trusted. Force operations still mutate task_runs only through
task.Service; they do not accept raw claim tokens, and they apply compare-and-swap state
preconditions before committing.
| Operation | CLI path | API route | Valid source state |
|---|---|---|---|
| Force release | agh task release run-123 --reason handoff | POST /api/runs/{id}/release | claimed |
| Force fail | agh task fail run-123 --reason "recovery" | POST /api/runs/{id}/fail | queued or claimed |
| Retry | agh task retry run-123 | POST /api/runs/{id}/retry | failed |
Bulk force release and force fail use bounded batches of run IDs:
agh task release run-123 run-456 --reason handoff -o json
agh task fail run-123 run-456 --reason "provider credentials revoked" -o jsonThe HTTP and UDS bulk routes are POST /api/runs/bulk/release and
POST /api/runs/bulk/fail. Bulk responses report one row per run so an agent can retry only the
failed rows.
Force fail requires a non-empty reason and records failure_kind = "operator_forced" on the
run. Retry creates a new queued run linked through previous_run_id and refuses chains deeper
than the runtime cap. Force release and force fail invalidate queued input generations for the
previously bound session when that session exists, so stale prompts cannot be delivered after a
recovery action.
task_runs.failure_kind is task-run recovery metadata. In v1 its public value is
"operator_forced"; provider authentication failures are recorded on the session failure fields
and provider diagnostics instead of on task-run rows.
Every force operation emits canonical audit events:
| Event | When AGH emits it |
|---|---|
task.run_released | A claimed run is force released back to the queue. |
task.run_operator_forced_fail | A queued or claimed run is force failed with operator evidence. |
task.run_operator_retry | A failed run creates a new queued retry linked to the source run. |
Agents may call these surfaces when [task.recovery].allow_agent_force = true. Set it to
false when only non-agent operator identities should perform recovery.
Scheduler and task pause
Pause controls stop new scheduler claims without freezing active ownership. In-flight runs keep heartbeating, completing, failing, releasing, and expiring through the normal lease recovery path. Use scheduler-wide pause when dispatch must stop globally, and task pause when one task or a task subtree must stop receiving new claims.
| Operation | CLI path | API route | Effect |
|---|---|---|---|
| Scheduler status | agh scheduler status | GET /api/scheduler | Shows pause state and queue pressure. |
| Scheduler pause | agh scheduler pause --reason "incident" | POST /api/scheduler/pause | Stops new dispatch and claim eligibility. |
| Scheduler resume | agh scheduler resume | POST /api/scheduler/resume | Re-enables new dispatch and claims. |
| Scheduler drain | agh scheduler drain --timeout 30s | POST /api/scheduler/drain | Pauses dispatch and waits for active claims. |
| Scheduler backlog | agh scheduler backlog --include-paused | GET /api/scheduler/backlog | Lists queued runs in scheduler order. |
| Task pause | agh task pause task-123 --reason "incident" | POST /api/tasks/{id}/pause | Stops new claims for that task subtree. |
| Task resume | agh task resume task-123 | POST /api/tasks/{id}/resume | Clears the direct task pause. |
Task pause is inherited by descendants through typed task columns. Backlog responses expose
effective_paused and paused_by_task_id so agents can tell whether a queued run is blocked by
its own task or by an ancestor. Scheduler backlog excludes paused tasks by default; pass
include_paused=true when diagnosing why a queued run is not claimable.
Every pause mutation records actor evidence and emits canonical events:
Scheduler drain returns a final JSON result over CLI, HTTP, and UDS. The v1 runtime does not expose a scheduler-drain SSE progress stream; clients should poll scheduler status or backlog separately when they need an independent progress view.
| Event | When AGH emits it |
|---|---|
scheduler.paused | Scheduler dispatch was paused or drain was requested. |
scheduler.resumed | Scheduler dispatch was resumed. |
scheduler.drain_started | Drain requested a scheduler pause and began waiting. |
scheduler.drain_completed | Drain reached zero active claims or timed out. |
task.paused | A task was paused for future scheduler claims. |
task.resumed | A task's direct pause was cleared. |
Task inspect diagnostics
Use inspect when an operator or agent needs a read-only triage snapshot before mutating a task run. The CLI auto-detects task and run IDs:
agh task inspect task-123 -o json
agh task inspect run-123 -o jsonThe same snapshot is available over HTTP and UDS:
| Target | Route |
|---|---|
| Task | GET /api/tasks/{id}/inspect |
| Run | GET /api/runs/{id}/inspect |
Inspect reads task runs, bound-session summary, scheduler pause state, and recent event summaries.
It does not read transcripts and it does not expose raw claim tokens. Claim evidence is limited to
the 8-character claim_token_hash_truncated field.
The response includes task, current_run, bound_session, recent_runs,
recent_events, scheduler, diagnostics, next_action, and as_of. The web task and
run detail surfaces render the same diagnostics card as the CLI/API payload.
| Diagnostic code | Meaning |
|---|---|
task_run_stuck | A claimed run has a stale heartbeat and may need release. |
task_run_orphan | A claimed run points at a missing or terminal session. |
task_run_stranded | A queued run is old, the scheduler is active, and no eligible session is visible. |
task_run_crashed | The latest run failed without a later retry. |
task_run_stale_lease | The run still looks claimed after its lease deadline. |
next_action is a derived enum for agents: claim_available, waiting_for_session,
stranded, running, recovery_required, or terminal. It is guidance, not a writer; use
the task service commands to release, fail, retry, pause, or resume.
Deterministic autonomy errors
Lease writers and the autonomy tool bridge return the same deterministic reason codes:
| Code | When it fires |
|---|---|
AUTONOMY_SESSION_REQUIRED | The caller had no session scope; tool/CLI/HTTP/UDS reject the call before lease lookup. |
AUTONOMY_NO_ACTIVE_LEASE | The session does not currently own a run lease; nothing to extend or finalize. |
AUTONOMY_FOREIGN_RUN | The supplied run_id does not match the session's active lease. |
AUTONOMY_LEASE_EXPIRED | The lookup found a stale or expired lease and refused to mutate it. |
AUTONOMY_LEASE_ALREADY_HELD | claim_next was called while the session already owns an active lease. |
Lease credentials and channels
Never send raw lease credentials through agh ch send, agh ch reply, agh__network_send,
network envelopes, logs, or memory. If another participant needs to know progress, send a
coordination message. If a session needs to prove ownership, it calls one of the agh__autonomy
tools or the matching agh task command for the owned run_id; AGH resolves the internal lease
server-side.
Related pages
- Coordination Channels explains channel metadata and message kinds.
- Task CLI Reference lists exact command flags for the parallel CLI surface.
- Tool CLI Reference lists the operator commands for inspecting the registry, including the
agh__autonomyfamily. - Hooks lists task-run hook events.
Coordinator Handoff
When AGH starts the workspace coordinator, how coordinator config resolves, and how operators keep manual control.
Task Execution Profiles
Typed task-owned overlays for coordinator guidance, worker selection, reviewer routing, participant policy, sandbox mode, and runtime evidence mode at session start.