Aegis Orchestrator
Guides

Edge Fleet Operations

Run tools across many edge hosts — rolling restarts, ad-hoc commands, redeploys — read streamed per-node results, and cancel runs in flight.

Edge Fleet Operations

This guide is the operational counterpart to the fleet operations concept page. Where that page describes the model, this page walks through the actual commands you run, the output you read, and the patterns that work in practice.

By the end you'll know how to:

  1. Run an ad-hoc command across many hosts.
  2. Roll a configuration change in waves with halt-on-failure.
  3. Read streaming per-node results.
  4. Cancel a runaway run.
  5. Inspect run history.

Prerequisites

  • At least two enrolled edge hosts in your tenant. See edge host setup.
  • Tags or groups defined for the targeting you want. See tag and group management.
  • A user with permission to call the system-tier aegis.edge.fleet.* MCP tools (operator or tenant-admin).

The shape of a fleet run command

aegis edge fleet run \
  --target <expr> \
  --tool <tool-name> [--arg key=value]... \
  [--mode parallel|sequential|rolling=N] \
  [--max-concurrency N] \
  [--on-error fail-fast|continue|stop-after=N] \
  [--require-min N] \
  [--deadline 60s]

Per-node results stream live; when every per-node call has terminated, a final summary is printed.

<expr> — target shorthand

FormMeaning
@<node-id>Single node.
group:<name>Saved group.
tags=a,b labels=k=v tools=dockerAd-hoc selector.
allEvery Connected edge of your tenant.

Recipe 1: ad-hoc parallel command

You want to know which kernel every Linux host in your fleet is running. Read-only, idempotent — parallel and continue-on-error are the right defaults.

aegis edge fleet run \
  --target tags=linux \
  --tool cmd.run --arg cmd="uname -r" \
  --mode parallel \
  --on-error continue \
  --deadline 10s

Streamed output (each line tagged with the originating node):

[n-7a3b2f workstation-east] 6.8.0-31-generic
[n-1c8d4e workstation-west] 6.8.0-31-generic
[n-9f2a31 db-mirror-1]      6.5.0-15-generic
[n-4e7c80 db-mirror-2]      6.5.0-15-generic
[n-3b1d9f bastion]          6.1.0-18-amd64

✔ Fleet run a3f4...c8d2 complete
  ok=5  err=0  timed_out=0

Recipe 2: rolling restart with halt-on-first-failure

You want to restart nginx across the web tier, in waves of five, and stop immediately if any host reports a failure.

aegis edge fleet run \
  --target group:web-tier \
  --tool service.restart --arg name=nginx \
  --mode rolling=5 \
  --on-error stop-after=1 \
  --deadline 30s

Streamed output:

[wave 1] starting 5 of 12 nodes
  [n-7a3b2f web-east-1]   restarting nginx... ok (exit=0, 1.2s)
  [n-1c8d4e web-east-2]   restarting nginx... ok (exit=0, 1.4s)
  [n-9f2a31 web-east-3]   restarting nginx... ok (exit=0, 1.1s)
  [n-4e7c80 web-east-4]   restarting nginx... ok (exit=0, 1.5s)
  [n-3b1d9f web-east-5]   restarting nginx... ok (exit=0, 1.3s)
[wave 2] starting 5 of 12 nodes
  [n-2a8c14 web-west-1]   restarting nginx... ok (exit=0, 1.2s)
  [n-5e1f93 web-west-2]   restarting nginx... ERR (exit=3, 0.4s) "service file not found"
✖ stop-after threshold reached (1/1); cancelling in-flight, halting waves

✔ Fleet run b8e1...d4a7 halted
  ok=6  err=1  cancelled=2  not_started=3

The halt reason and per-node breakdown are explicit. The cancelled=2 accounts for in-flight calls in wave 2 that were cancelled when the threshold tripped.


Recipe 3: refuse to dispatch unless N hosts match

For safety-critical operations you may want a hard floor: "if fewer than 3 hosts match, don't run at all."

aegis edge fleet run \
  --target tags=db,prod \
  --tool cmd.run --arg cmd="systemctl status postgres" \
  --require-min 3

If only 2 hosts match, the dispatch is refused upfront and no per-node call is made:

✖ require-min not satisfied: matched 2, required 3
  Resolved nodes:
    n-9f2a31  db-mirror-1   tags=[prod,db]   Connected
    n-4e7c80  db-mirror-2   tags=[prod,db]   Connected

Recipe 4: preview before destructive fan-out

Before running anything destructive, verify the resolved target set:

aegis edge fleet preview --target tags=prod
Resolved 8 nodes (skipped: 1)
  ✓ n-7a3b2f  web-east-1     linux/x86_64  Connected     tags=[prod,web]
  ✓ n-1c8d4e  web-east-2     linux/x86_64  Connected     tags=[prod,web]
  ...
  ⊗ n-9f2a31  db-mirror-1    linux/x86_64  Disconnected  tags=[prod,db]

Disconnected hosts are listed under skipped with their reason — they're visible to the operator but won't receive the call.

The preview is also available in Zaru's fleet launcher modal as the Selector Preview Panel (it counts hosts as you build the selector) and as the system-tier MCP tool aegis.edge.fleet.list.


Recipe 5: cancel a runaway run

If a fleet run is taking too long or you realize it's misconfigured, cancel it by fleet_command_id:

aegis edge fleet cancel a3f4...c8d2

The dispatcher broadcasts Cancel to every in-flight per-node command. Any wrapped tool that respects context cancellation halts; native external processes get a SIGTERM. Already-completed nodes are unaffected.

The same operation is exposed as the system-tier MCP tool aegis.edge.fleet.cancel and as a Cancel button in Zaru's live run view.


Recipe 6: redeploy a binary in waves

You've built a new binary, copied it to a known location on every host, and want to swap it in:

# Wave 1: validate the new binary on a smoke-test host.
aegis edge fleet run \
  --target @n-smoke-test-host \
  --tool cmd.run --arg cmd="/opt/myapp/bin/new --version" \
  --deadline 5s

# Wave 2: roll across the fleet, 3 at a time, halt on first failure.
aegis edge fleet run \
  --target group:myapp-fleet \
  --tool myapp.swap --arg version=2.4.0 \
  --mode rolling=3 \
  --on-error stop-after=1 \
  --deadline 30s

Two-phase rollouts (canary → fleet) become a habit, not a special case.


Reading the live run view in Zaru

When a fleet run starts in Zaru, you get a per-node grid:

┌─────────────────────────────────────────────────────────────┐
│  Fleet Run a3f4...c8d2                                       │
│  Tool: service.restart  Args: name=nginx                    │
│  Mode: rolling=5  On error: stop-after=1                    │
│  Status: running    [Cancel]                                │
├─────────────────────────────────────────────────────────────┤
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐     │
│  │ web-east-1    │ │ web-east-2    │ │ web-east-3    │     │
│  │ ✔ ok          │ │ ✔ ok          │ │ ✔ ok          │     │
│  │ exit=0  1.2s  │ │ exit=0  1.4s  │ │ exit=0  1.1s  │     │
│  └───────────────┘ └───────────────┘ └───────────────┘     │
│  ┌───────────────┐ ┌───────────────┐                       │
│  │ web-east-4    │ │ web-east-5    │                       │
│  │ ⏳ running    │ │ ⏳ running    │                       │
│  │ ┃ stdout...   │ │ ┃ stdout...   │                       │
│  └───────────────┘ └───────────────┘                       │
└─────────────────────────────────────────────────────────────┘

Each cell shows status, exit code, runtime, and a tail of stdout/stderr. Click a cell to expand its full output stream.


Inspecting run history

aegis edge fleet runs --output table
ID         STARTED                TOOL              TARGET            MODE       STATUS    ok/err/skipped
a3f4...c8  2026-04-28 14:32:11Z   service.restart   group:web-tier    rolling=5  halted    6/1/3
b8e1...d4  2026-04-28 11:08:43Z   cmd.run           tags=linux        parallel   complete  12/0/0

In Zaru, Vault → Edge Hosts → Fleet Runs lists every run with the same fields and links into the live (or archived) per-node view.


Anti-patterns

❌ Don't✅ Do
--target all --on-error continue for state-mutating toolsUse a more specific target, prefer --on-error fail-fast or --on-error stop-after=N.
Skip --require-min for safety-critical operationsSet a floor that matches your contract.
Run destructive fan-outs without previewaegis edge fleet preview first; check the resolved set.
Long deadlines on rolling deploysShort per-target deadlines force fast failures and tighter waves.
Re-using a single tag for unrelated meaningsPick tag axes — see tag conventions.

What's next

On this page