Aegis Orchestrator
Guides

Custom Runtime Agents (Advanced)

Advanced guide for building a custom container image and bootstrap script when manifest-only agents are not enough.

Custom Runtime Agents (Advanced)

This guide covers the advanced path for agents that require a custom container image, non-standard dependencies, or a custom runtime script.

For most use cases, use the manifest-first default guide: Writing Your First Agent.


When to Use This Path

Use a custom runtime only when you need one or more of the following:

  • OS-level packages or binaries not available in the default runtime
  • Language runtimes or libraries outside standard AEGIS defaults
  • A specialized bootstrap loop for tightly controlled execution behavior

If you only need instruction, tools, security policies, and validation, stay manifest-only.


Two Paths: Standard Runtime vs CustomRuntime

AEGIS provides two runtime modes, specified in spec.runtime:

StandardRuntime (Manifest-Only Path)

Specify language and version; orchestrator determines the official Docker image:

spec:
  runtime:
    language: python
    version: "3.11"
    # Orchestrator resolves to: python:3.11-slim (Docker Hub)

Best for: Most agents. No image building. Automatic updates.

CustomRuntime (This Path)

Specify a fully-qualified Docker image reference instead:

spec:
  runtime:
    image: "ghcr.io/my-org/my-agent:v1.0"
    image_pull_policy: IfNotPresent  # Always | IfNotPresent | Never

Mutual Exclusion: image and language/version are mutually exclusive — the orchestrator infers the runtime type from which fields are present:

imagelanguageversionResult
StandardRuntime ✅
Error — must specify either image OR language+version
Errorlanguage requires version
Errorversion requires language
CustomRuntime ✅
Error — cannot specify both image and language
Error — cannot specify both image and version
Error — cannot specify image + language + version

Validation is performed at manifest deserialization time; invalid combinations are rejected before any container is started.

Image format: The image value must be fully-qualified and include a registry component (at least one /). Bare names like my-agent:latest are invalid; use myregistry.io/myorg/my-agent:latest.

You must build and push the image (see Step 1 below).


Project Structure

my-agent/
├── agent.yaml
├── bootstrap.py
├── Dockerfile
└── output_schema.json

Step 1: Build the Container Image

FROM python:3.11-slim

RUN pip install --no-cache-dir aegis-sdk

WORKDIR /agent
COPY bootstrap.py .
COPY output_schema.json .

CMD ["python", "/agent/bootstrap.py"]
docker build -t myregistry/my-agent:latest .
docker push myregistry/my-agent:latest

Step 2: Implement bootstrap.py

bootstrap.py is the in-container entrypoint. It implements the Aegis Dispatch Protocol — a bidirectional loop over POST /v1/dispatch-gateway that the orchestrator uses to drive the LLM conversation and dispatch in-container commands.

Do not import aegis-sdk inside a custom bootstrap script. The SDK is a control-plane client for deploying agents from outside the runtime. A custom bootstrap must be stdlib-only (Python) or equivalent, to avoid a pip-install dependency before the container is ready. Use the aegis.bootstrap module types from the SDK only as a local reference during development — do not ship the import.

The protocol has three message types:

MessageDirectionMeaning
AgentMessage {type:"generate"}bootstrap → orchestratorStart / continue the inner loop
OrchestratorMessage {type:"dispatch"}orchestrator → bootstrapRun a subprocess inside the container
OrchestratorMessage {type:"final"}orchestrator → bootstrapInner loop complete; print content and exit
#!/usr/bin/env python3
"""Custom bootstrap — implements Aegis Dispatch Protocol.

This file runs inside the agent container. It is stdlib-only; no third-party
packages are imported. The orchestrator fully renders the prompt and passes it
as argv[1] before this script is executed.
"""
import json
import os
import subprocess
import sys
import time
import urllib.error
import urllib.request

ORCHESTRATOR_URL = os.environ.get("AEGIS_ORCHESTRATOR_URL", "http://host.docker.internal:8088")
EXECUTION_ID = os.environ["AEGIS_EXECUTION_ID"]
AGENT_ID = os.environ.get("AEGIS_AGENT_ID", "")
ITERATION = int(os.environ.get("AEGIS_ITERATION", "1"))
MODEL_ALIAS = os.environ["AEGIS_MODEL_ALIAS"]


def post_json(payload: dict, timeout: int = 0) -> dict:
    """POST to /v1/dispatch-gateway. timeout=0 disables socket timeout (long-poll)."""
    data = json.dumps(payload).encode("utf-8")
    req = urllib.request.Request(
        f"{ORCHESTRATOR_URL}/v1/dispatch-gateway",
        data=data,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=timeout or None) as resp:
        return json.loads(resp.read().decode("utf-8"))


def run_dispatch(msg: dict) -> dict:
    """Execute an action dispatched by the orchestrator and return the result."""
    dispatch_id = msg["dispatch_id"]
    action = msg.get("action")

    if action == "exec":
        command = [msg["command"]] + msg.get("args", [])
        env = os.environ.copy()
        env.update(msg.get("env_additions", {}))
        timeout_secs = msg.get("timeout_secs", 60)
        max_bytes = msg.get("max_output_bytes", 524288)
        started = time.monotonic()
        try:
            result = subprocess.run(
                command,
                cwd=msg.get("cwd", "/workspace"),
                env=env,
                capture_output=True,
                timeout=timeout_secs,
            )
            duration_ms = int((time.monotonic() - started) * 1000)
            stdout = result.stdout.decode("utf-8", errors="replace")
            stderr = result.stderr.decode("utf-8", errors="replace")
            truncated = len((stdout + stderr).encode()) > max_bytes
            if truncated:
                half = max_bytes // 2
                stdout, stderr = stdout[-half:], stderr[-half:]
            return {
                "type": "dispatch_result",
                "execution_id": EXECUTION_ID,
                "dispatch_id": dispatch_id,
                "exit_code": result.returncode,
                "stdout": stdout,
                "stderr": stderr,
                "duration_ms": duration_ms,
                "truncated": truncated,
            }
        except subprocess.TimeoutExpired:
            return {
                "type": "dispatch_result",
                "execution_id": EXECUTION_ID,
                "dispatch_id": dispatch_id,
                "exit_code": -1,
                "stdout": "",
                "stderr": f"[AEGIS] Command timed out after {timeout_secs}s",
                "duration_ms": timeout_secs * 1000,
                "truncated": False,
            }
    # Unknown action — report gracefully so the orchestrator can inject a tool error.
    return {
        "type": "dispatch_result",
        "execution_id": EXECUTION_ID,
        "dispatch_id": dispatch_id,
        "exit_code": -1,
        "stdout": "",
        "stderr": f"unknown_action:{action}",
        "duration_ms": 0,
        "truncated": False,
    }


def main():
    prompt = sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read().strip()
    if not prompt:
        print("Error: no prompt provided", file=sys.stderr)
        sys.exit(1)

    # Send the initial generate request to start the inner loop.
    msg = post_json(
        {
            "type": "generate",
            "agent_id": AGENT_ID,
            "execution_id": EXECUTION_ID,
            "iteration_number": ITERATION,
            "model_alias": MODEL_ALIAS,
            "prompt": prompt,
            "messages": [],
        }
    )

    # Execute any dispatch commands until the orchestrator issues type="final".
    while msg.get("type") == "dispatch":
        msg = post_json(run_dispatch(msg))

    # Print the final LLM response to stdout; the orchestrator captures it.
    print(msg.get("content", ""))


if __name__ == "__main__":
    main()

Step 3: Add Output Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["solution_path", "output"],
  "properties": {
    "solution_path": {
      "type": "string",
      "pattern": "^/workspace/"
    },
    "output": {
      "type": "string",
      "minLength": 1
    }
  },
  "additionalProperties": false
}

Step 4: Configure agent.yaml

apiVersion: 100monkeys.ai/v1
kind: Agent
metadata:
  name: python-coder
  version: "1.0.0"
spec:
  # CustomRuntime: Specify image instead of language+version
  runtime:
    image: "ghcr.io/my-org/my-agent:latest"
    image_pull_policy: IfNotPresent  # Always | IfNotPresent | Never
    isolation: docker

  task:
    instruction: |
      Solve the provided coding task and write output to /workspace/result.json.

  execution:
    mode: iterative
    max_iterations: 10
    validation:
      system:
        must_succeed: true
      output:
        format: json
        schema:
          type: object
          required: ["solution_path", "output"]
          properties:
            solution_path:
              type: string
            output:
              type: string

  security:
    network:
      mode: allow
      allowlist:
        - pypi.org
    filesystem:
      read:
        - /workspace
        - /agent
      write:
        - /workspace
    resources:
      cpu: 1000
      memory: "1Gi"
      timeout: "300s"

  volumes:
    - name: workspace
      storage_class: ephemeral
      mount_path: /workspace
      access_mode: read-write
      ttl_hours: 1
      size_limit: "5Gi"

  tools:
    - name: filesystem
      server: "mcp:filesystem"
      config:
        allowed_paths: ["/workspace", "/agent"]
        access_mode: read-write

Security Policy Enforcement

All spec.security policies — network allowlist, filesystem permissions, and resource limits — are enforced by the orchestrator regardless of what the container image contains. A custom image cannot override or bypass these constraints.

This applies to every field under spec.security: network mode and allowlist, filesystem read/write paths, and CPU/memory/timeout limits. The orchestrator layers enforce isolation at the network and filesystem level, independent of what runs inside the image.


Bootstrap Handling

When you specify spec.runtime.image, the orchestrator injects its standard bootstrap script into your container, which manages the 100monkeys iteration loop.

Default behavior (Option A — Orchestrator injects bootstrap):

The orchestrator copies assets/bootstrap.py into the container at /usr/local/bin/aegis-bootstrap if that path is not already present, then executes it. Your image must have Python available.

spec:
  runtime:
    image: "ghcr.io/myorg/my-agent:latest"
  # Bootstrap injected automatically by orchestrator

Custom bootstrap (Option B — Bootstrap bundled in image):

Include your own bootstrap script in the image and declare its path in spec.advanced.bootstrap_path. The orchestrator detects the file is already present and skips injection, executing your script directly instead.

spec:
  runtime:
    image: "ghcr.io/myorg/node-agent:1.0"
  advanced:
    bootstrap_path: "/agent/bootstrap.js"  # Path inside the container
# Dockerfile
FROM node:20-alpine
RUN apk add --no-cache python3
COPY bootstrap.js /agent/bootstrap.js
CMD ["tail", "-f", "/dev/null"]

The custom bootstrap must implement the dispatch protocol to communicate with the orchestrator via AEGIS_ORCHESTRATOR_URL.


Step 5: Deploy and Execute

aegis agent deploy ./my-agent/agent.yaml
aegis task execute python-coder --input '{"task":"Write a prime checker"}' --follow

Common Issues

SymptomCauseFix
Image pull failureRegistry auth/image tag issueVerify image exists and is fully-qualified (registry/org/image:tag); see Container Registry & Image Management for credential setup
Startup error in containerMissing package or bad entrypointValidate Dockerfile and CMD
Tool call rejectedTool not declared in manifestAdd required tool to spec.tools
Timeout during runHeavy workload or slow dependencyIncrease resource timeout or optimize bootstrap flow

On this page