Skip to content
Autonomy
AGH RuntimeAutonomy

Task Runs and Leases

How agents claim work, renew leases, finish runs, and release session-bound ownership safely.

Audience
Operators running durable agent work
Focus
Autonomy guidance shaped for scanability, day-two clarity, and operator context.

Task runs are the durable execution records for tasks. A run becomes claimable only after publish, start, approval, UI start, automation approval, or an equivalent API enqueues it. The task service is the only authority for run ownership and terminal state.

Claiming the next run

A managed agent session claims work either through the dedicated autonomy tool family or the parallel CLI. Both routes call the same task service writers and obey the same session-bound contract.

Tool path:

agh__task_run_claim_next  { "wait": true, "lease_seconds": 300 }

CLI path:

agh task next --wait --lease-seconds 300 -o json

The claim is atomic. AGH selects one eligible queued run, binds it to the current managed session, sets a lease deadline, and returns a synchronous claim response that includes:

  • task summary
  • run summary
  • safe lease summary
  • claim_token_hash for observability
  • coordination channel metadata when the run has a bound channel

The raw bearer lease token is internal to AGH. Public CLI, HTTP, UDS, native-tool, web, stream, log, channel, and memory payloads use the calling session plus run_id and expose at most claim_token_hash. Tools belonging to the agh__autonomy toolset reject any input or response field that would carry a raw claim token.

Lease rules

The MVP lease contract is intentionally narrow:

RuleBehavior
One ownerExactly one managed session may own a non-terminal run lease.
One active lease per sessionA managed session may hold at most one active task-run lease in the MVP.
Session fencingHeartbeat, complete, fail, and release resolve the internal lease from the caller session and run_id.
Bounded renewal--lease-seconds must be zero or positive and is capped by the task service. Omitted or zero uses the service default.
Expiry recoveryExpired leases are recovered by boot recovery or scheduler sweeps through the task service.
Stale holders failA stale heartbeat or late complete after recovery fails explicitly; it does not extend or overwrite a newer claim.

Heartbeat

Use heartbeat when work is still active and the session still owns the run.

Tool path:

agh__task_run_heartbeat  { "run_id": "run-123", "lease_seconds": 300 }

CLI path:

agh task heartbeat run-123 --lease-seconds 300

Heartbeats are task-service operations. Do not mirror routine heartbeats into coordination channel messages unless a human-readable update is useful.

Completing a run

Use complete for successful terminal state.

Tool path:

agh__task_run_complete  { "run_id": "run-123", "result": { "summary": "tests passed" } }

CLI path:

agh task complete run-123 \
  --result '{"summary":"tests passed"}'

The optional result JSON must not contain raw lease credentials.

Auto-enqueue on ready

Tasks are opt-in for dependency-driven auto-enqueue. When a task carries auto_enqueue_on_ready, AGH enqueues its next run automatically as soon as a blocking dependency completes and the task becomes ready — no manual enqueue is required to advance the DAG.

Set it at create time, or toggle it later on an assembled tree:

agh task create --scope global --title "Deploy" --auto-enqueue-on-ready
agh task update task-123 --auto-enqueue-on-ready          # turn on
agh task update task-123 --auto-enqueue-on-ready=false    # turn off

The behavior is deliberately conservative:

  • Readiness gates it. Only a successful completion reconciles dependents to ready; a failed or expired blocker does not satisfy a blocks edge, so it never triggers a premature enqueue.
  • Paused dependents are skipped. An effectively paused dependent is left untouched until it resumes.
  • At most one open run. Enqueue reuses the canonical task_runs path. The store's queued-run reservation rejects a second open run, so concurrent completions of different blockers — or a retried completion — converge on exactly one queued run, never duplicates.
  • Completion never rolls back. Auto-enqueue runs after the completion has durably committed and is best-effort: it survives request cancellation and a failed enqueue is logged, not propagated back to the completing caller.

Failing a claimed run

Use the session-bound fail path when the current claim cannot complete successfully. This path is token-fenced server-side: the caller supplies a run_id, and AGH resolves the managed session's active lease before mutating the run.

Tool path:

agh__task_run_fail  { "run_id": "run-123", "error": "tests failed", "metadata": { "command": "make test" } }

CLI path:

agh task run fail run-123 \
  --error "tests failed" \
  --metadata '{"command":"make test"}'

Failure metadata must not contain raw lease credentials.

Releasing a claimed run

Use session-bound release when the current managed session should give up ownership without making the run terminal. The autonomy tool resolves the internal lease from the caller session and run_id; raw claim tokens never cross the public surface.

Tool path:

agh__task_run_release  { "run_id": "run-123", "reason": "handoff" }

Release is also used by daemon-owned cleanup. For example, if a spawned child reaches TTL or its parent stops, AGH releases active child leases with structured reasons such as ttl_expired or parent_stopped before stopping the child session.

Force operations

Use force operations when a run needs operator recovery and the normal session-bound lease path is not available or should not be trusted. Force operations still mutate task_runs only through task.Service; they do not accept raw claim tokens, and they apply compare-and-swap state preconditions before committing.

OperationCLI pathAPI routeValid source state
Force releaseagh task release run-123 --reason handoffPOST /api/runs/{id}/releaseclaimed
Force failagh task fail run-123 --reason "recovery"POST /api/runs/{id}/failqueued or claimed
Retryagh task retry run-123POST /api/runs/{id}/retryfailed

Bulk force release and force fail use bounded batches of run IDs:

agh task release run-123 run-456 --reason handoff -o json
agh task fail run-123 run-456 --reason "provider credentials revoked" -o json

The HTTP and UDS bulk routes are POST /api/runs/bulk/release and POST /api/runs/bulk/fail. Bulk responses report one row per run so an agent can retry only the failed rows.

Force fail requires a non-empty reason and records failure_kind = "operator_forced" on the run. Retry creates a new queued run linked through previous_run_id and refuses chains deeper than the runtime cap. Force release and force fail invalidate queued input generations for the previously bound session when that session exists, so stale prompts cannot be delivered after a recovery action.

task_runs.failure_kind is task-run recovery metadata. In v1 its public value is "operator_forced"; provider authentication failures are recorded on the session failure fields and provider diagnostics instead of on task-run rows.

Every force operation emits canonical audit events:

EventWhen AGH emits it
task.run_releasedA claimed run is force released back to the queue.
task.run_operator_forced_failA queued or claimed run is force failed with operator evidence.
task.run_operator_retryA failed run creates a new queued retry linked to the source run.

Agents may call these surfaces when [task.recovery].allow_agent_force = true. Set it to false when only non-agent operator identities should perform recovery.

Scheduler and task pause

Pause controls stop new scheduler claims without freezing active ownership. In-flight runs keep heartbeating, completing, failing, releasing, and expiring through the normal lease recovery path. Use scheduler-wide pause when dispatch must stop globally, and task pause when one task or a task subtree must stop receiving new claims.

OperationCLI pathAPI routeEffect
Scheduler statusagh scheduler statusGET /api/schedulerShows pause state and queue pressure.
Scheduler pauseagh scheduler pause --reason "incident"POST /api/scheduler/pauseStops new dispatch and claim eligibility.
Scheduler resumeagh scheduler resumePOST /api/scheduler/resumeRe-enables new dispatch and claims.
Scheduler drainagh scheduler drain --timeout 30sPOST /api/scheduler/drainPauses dispatch and waits for active claims.
Scheduler backlogagh scheduler backlog --include-pausedGET /api/scheduler/backlogLists queued runs in scheduler order.
Task pauseagh task pause task-123 --reason "incident"POST /api/tasks/{id}/pauseStops new claims for that task subtree.
Task resumeagh task resume task-123POST /api/tasks/{id}/resumeClears the direct task pause.

Task pause is inherited by descendants through typed task columns. Backlog responses expose effective_paused and paused_by_task_id so agents can tell whether a queued run is blocked by its own task or by an ancestor. Scheduler backlog excludes paused tasks by default; pass include_paused=true when diagnosing why a queued run is not claimable.

Every pause mutation records actor evidence and emits canonical events:

Scheduler drain returns a final JSON result over CLI, HTTP, and UDS. The v1 runtime does not expose a scheduler-drain SSE progress stream; clients should poll scheduler status or backlog separately when they need an independent progress view.

EventWhen AGH emits it
scheduler.pausedScheduler dispatch was paused or drain was requested.
scheduler.resumedScheduler dispatch was resumed.
scheduler.drain_startedDrain requested a scheduler pause and began waiting.
scheduler.drain_completedDrain reached zero active claims or timed out.
task.pausedA task was paused for future scheduler claims.
task.resumedA task's direct pause was cleared.

Task inspect diagnostics

Use inspect when an operator or agent needs a read-only triage snapshot before mutating a task run. The CLI auto-detects task and run IDs:

agh task inspect task-123 -o json
agh task inspect run-123 -o json

The same snapshot is available over HTTP and UDS:

TargetRoute
TaskGET /api/tasks/{id}/inspect
RunGET /api/runs/{id}/inspect

Inspect reads task runs, bound-session summary, scheduler pause state, and recent event summaries. It does not read transcripts and it does not expose raw claim tokens. Claim evidence is limited to the 8-character claim_token_hash_truncated field.

The response includes task, current_run, bound_session, recent_runs, recent_events, scheduler, diagnostics, next_action, and as_of. The web task and run detail surfaces render the same diagnostics card as the CLI/API payload.

Diagnostic codeMeaning
task_run_stuckA claimed run has a stale heartbeat and may need release.
task_run_orphanA claimed run points at a missing or terminal session.
task_run_strandedA queued run is old, the scheduler is active, and no eligible session is visible.
task_run_crashedThe latest run failed without a later retry.
task_run_stale_leaseThe run still looks claimed after its lease deadline.

next_action is a derived enum for agents: claim_available, waiting_for_session, stranded, running, recovery_required, or terminal. It is guidance, not a writer; use the task service commands to release, fail, retry, pause, or resume.

Deterministic autonomy errors

Lease writers and the autonomy tool bridge return the same deterministic reason codes:

CodeWhen it fires
AUTONOMY_SESSION_REQUIREDThe caller had no session scope; tool/CLI/HTTP/UDS reject the call before lease lookup.
AUTONOMY_NO_ACTIVE_LEASEThe session does not currently own a run lease; nothing to extend or finalize.
AUTONOMY_FOREIGN_RUNThe supplied run_id does not match the session's active lease.
AUTONOMY_LEASE_EXPIREDThe lookup found a stale or expired lease and refused to mutate it.
AUTONOMY_LEASE_ALREADY_HELDclaim_next was called while the session already owns an active lease.

Lease credentials and channels

Never send raw lease credentials through agh ch send, agh ch reply, agh__network_send, network envelopes, logs, or memory. If another participant needs to know progress, send a coordination message. If a session needs to prove ownership, it calls one of the agh__autonomy tools or the matching agh task command for the owned run_id; AGH resolves the internal lease server-side.

On this page