Aegis Orchestrator
Architecture

Execution Engine

Deep dive into the ExecutionSupervisor outer loop, InnerLoopService, and the Aegis Dispatch Protocol.

Execution Engine

The AEGIS execution engine is the subsystem that runs an agent against a task and iteratively refines the output until it passes validation. It consists of two nested loops:

  • Outer loop (ExecutionSupervisor): Manages the full execution lifecycle across up to max_iterations attempts.
  • Inner loop (InnerLoopService): Runs inside each iteration — drives the LLM conversation and intercepts tool calls until the model produces a final (non-tool-call) response.

Domain Model

/// Maximum swarm nesting depth; enforced at spawn time.
pub const MAX_RECURSIVE_DEPTH: u8 = 3;

pub struct Execution {
    pub id: ExecutionId,
    pub agent_id: AgentId,
    pub status: ExecutionStatus,
    pub iterations: Vec<Iteration>,
    pub max_iterations: u8,              // default: 10
    pub input: ExecutionInput,
    pub started_at: DateTime<Utc>,
    pub ended_at: Option<DateTime<Utc>>,
    pub error: Option<String>,
    pub hierarchy: ExecutionHierarchy,   // depth tracking; max depth = MAX_RECURSIVE_DEPTH
    pub container_uid: u32,              // default: 1000 — UID squashed onto all NFS file ops
    pub container_gid: u32,              // default: 1000 — GID squashed onto all NFS file ops
}

pub struct ExecutionHierarchy {
    pub parent_execution_id: Option<ExecutionId>,
    pub depth: u8,  // 0 = root; increments with each swarm child level
    pub path: Vec<ExecutionId>,  // breadcrumb from root to this execution
    pub swarm_id: Option<SwarmId>, // foreign key to the swarm coordination context
}

pub struct Iteration {
    pub number: u8,                           // 1-based
    pub status: IterationStatus,
    pub action: String,
    pub output: Option<String>,
    pub validation_results: Option<ValidationResults>,
    pub error: Option<IterationError>,
    pub code_changes: Option<CodeDiff>,
    pub started_at: DateTime<Utc>,
    pub ended_at: Option<DateTime<Utc>>,
    pub llm_interactions: Vec<LlmInteraction>, // full message history for this iteration
}

ExecutionStatus State Machine

pending ──► running ──► completed


            failed

            cancelled  (from any state)

Outer Loop: ExecutionSupervisor

The ExecutionSupervisor manages a single Execution aggregate and orchestrates the iteration lifecycle.

ExecutionSupervisor

├── 1. Resolve Agent manifest from AgentRepository
├── 2. Create Execution aggregate (status=pending)
├── 3. Provision volumes → start NFS gateways for each volume
├── 4. Pull and start container (Docker or Podman via bollard / ContainerRuntime)

└── ITERATION LOOP (max_iterations times):

    ├── 5. Call InnerLoopService.run(context)
    │         → blocks until LLM produces final response
    │         → returns IterationOutput

    ├── 6. Run all validators in manifest.validation
    │         → each produces ValidationScore (0.0–1.0) + Confidence

    ├── 7. Evaluate aggregate score
    │    ├── Score ≥ threshold for all validators → SUCCESS
    │    │     └── Set Execution.status = completed
    │    │     └── Publish ExecutionCompleted event
    │    │
    │    ├── Score < threshold AND iterations_remaining > 0 → REFINE
    │    │     └── Inject error context into next iteration context
    │    │     └── Start next Iteration
    │    │
    │    └── Score < threshold AND iterations_remaining == 0 → FAIL
    │          └── Set Execution.status = failed
    │          └── Publish ExecutionFailed event

└── 8. Stop container, detach volumes, release locks

Container Cleanup Guarantees

The supervisor provides defense-in-depth container cleanup to prevent orphaned containers:

  1. Explicit termination — On every normal exit path (success, failure, timeout, cancellation), the supervisor calls runtime.terminate() which force-removes the container.
  2. RAII guard (ContainerGuard) — Created immediately after runtime.spawn() succeeds. If the code panics or takes an unexpected error path before reaching the explicit terminate call, the guard's Drop implementation spawns a tokio task to force-terminate the container. The guard is defused before intentional termination or debug-retain to avoid double cleanup.
  3. Background orphan reaper — A daemon-level background task runs every 5 minutes (with an immediate first tick on startup). It cross-references all containers labeled aegis.managed=true against execution status in the database, and force-removes any container whose execution is missing, completed, failed, or cancelled.

The keep_container_on_failure flag (set in the agent manifest) bypasses all three layers for failed containers, preserving them for manual debugging.


FUSE Bind Mount Initialization

In rootless Podman deployments, the orchestrator uses the host-side FUSE daemon to provide workspace volumes instead of NFS. During container creation (step 4 of the outer loop), the orchestrator performs the following sequence:

  1. Calls FuseMountService.Mount(execution_id, volume_id, tenant_id) on the FUSE daemon via gRPC.
  2. The FUSE daemon creates a FUSE mountpoint at <mount_prefix>/<volume_id> and returns the host path.
  3. The orchestrator configures the container with a bind mount from the host FUSE path to the container's declared mount_path (typically /workspace).
  4. Starts the container -- the agent sees a standard POSIX filesystem at /workspace.
  5. On execution end (step 8), the orchestrator calls FuseMountService.Unmount(execution_id, volume_id) to tear down the mountpoint.

This replaces the NFS volume driver mount used in rootful Docker deployments. The agent container code is identical in both cases -- only the volume mount mechanism differs. Transport selection is determined by the presence of fuse_daemon_address in the node configuration.


Inner Loop: InnerLoopService

The InnerLoopService drives the conversation loop for a single iteration. It communicates with bootstrap.py running inside the container via the Aegis Dispatch Protocol.

Communication Protocol

The inner loop uses a single HTTP endpoint on the orchestrator host:

POST /v1/dispatch-gateway

This endpoint handles both initial LLM generation requests and the results of in-container command dispatches. For a detailed breakdown of the message schema and loop mechanics, see the Aegis Dispatch Protocol.

Tool Schema Injection

At the start of every generate() call, InnerLoopService fetches the full list of registered tools from ToolInvocationService and maps each one to an OpenAI-format function schema. This schema array is passed to the LLM provider API alongside the conversation history so the model knows which tools are available.

// InnerLoopService::generate() — step 1
let available_tools = tool_invocation_service.get_available_tools().await?;
let tool_schemas: Vec<Value> = available_tools.iter().map(|t| json!({
    "type": "function",
    "function": {
        "name":        &t.name,
        "description": &t.description,
        "parameters":  &t.input_schema,
    }
})).collect();

Tool schemas surface through two layers. The node operator declares which tool servers exist in aegis-config.yaml — this is the ceiling of what any agent on that node can ever invoke. Each agent manifest then declares a spec.tools block that selects the subset of those node-level tools the agent is permitted to use and applies per-tool policy constraints (path allowlists, domain allowlists, subcommand_allowlist, rate limits, and so on).

Only the tools named in spec.tools are included in the schema array sent to the LLM; tools registered on the node but absent from the manifest are invisible to the agent. bootstrap.py itself never sees or manages tool schemas — schema injection and policy enforcement happen entirely in the orchestrator before any dispatch or generation response is returned.


Inner Loop Steps

bootstrap.py  ──► POST /v1/dispatch-gateway {type:"generate", messages:[...]}

                         Orchestrator receives request
                         Fetches tool schemas from ToolInvocationService
                         Calls LLM provider API with full message history + tool schemas
                         LLM returns response (may contain tool calls)

            ┌───────────────────▼────────────────────────────────────────┐
            │  Tool calls in response?                                   │
            │                                                            │
            │  YES → For each tool call:                                 │
            │    0. If `execution.tool_validation` applies,             │
            │       run semantic judge before dispatch                   │
            │       (child execution, pre-execution, synchronous)       │
            │    1. Route only if the judge passes:                     │
            │       fs.* calls   → AegisFSAL (host, direct)             │
            │       cmd.run      → Dispatch Protocol (in-container)     │
            │       web.*, etc.  → SEAL External (host MCP server)      │
            │                                                            │
            │   cmd.run → return {type:"dispatch", dispatch_id, ...}     │
            │        bootstrap.py receives dispatch message              │
            │        runs subprocess                                     │
            │        POST /v1/dispatch-gateway {type:"dispatch_result", ...} │
            │        loop continues with subprocess result in context    │
            │                                                            │
            │  NO → return {type:"final", content:"..."}                 │
            │       InnerLoopService marks iteration status              │
            └────────────────────────────────────────────────────────────┘

Dispatch Protocol

The Aegis Dispatch Protocol is the mechanism the orchestrator uses to trigger subprocess execution inside the agent container. It is designed to be runtime-agnostic, enabling AEGIS to switch between Docker, Podman, and Firecracker without modification to the agent's execution logic.

AgentRuntime Trait and ContainerRuntime

The AgentRuntime trait abstracts over all supported container backends. The concrete implementation, ContainerRuntime (previously named DockerRuntime), uses the bollard crate to communicate with any Docker-API-compatible daemon -- including both Docker Engine and Podman. The same trait will back future Firecracker support as a separate implementation.

For full technical details, including wire formats and the bootstrap.py implementation, see the Aegis Dispatch Protocol concept page.

Security and Subcommand Allowlist

Every cmd.run invocation is validated against the subcommand_allowlist declared in the agent manifest's spec.tools block before execution. This ensures that even within the container, agents can only execute a strictly defined set of commands and subcommands (e.g., allowing cargo build but not cargo publish).

Policy enforcement happens entirely on the orchestrator host; bootstrap.py receives only commands that have already passed all security checks.


Execution Context in Refinement

When a validator fails and a new iteration begins, the orchestrator injects additional context into the next iteration's message history:

{
  "role": "system",
  "content": "Iteration 1 failed validation.\n\nValidator: json_schema\nScore: 0.0 (threshold: 1.0)\nDetails: Required property 'output' is missing.\n\nPlease fix the issue and try again."
}

This context is injected after the user's original task message but before the new iteration begins, so the LLM sees both the original task and the specific failure reason.

On this page