Custom Runtime Agents (Advanced)
Advanced guide for building a custom container image and bootstrap script when manifest-only agents are not enough.
Custom Runtime Agents (Advanced)
This guide covers the advanced path for agents that require a custom container image, non-standard dependencies, or a custom runtime script.
For most use cases, use the manifest-first default guide: Writing Your First Agent.
When to Use This Path
Use a custom runtime only when you need one or more of the following:
- OS-level packages or binaries not available in the default runtime
- Language runtimes or libraries outside standard AEGIS defaults
- A specialized bootstrap loop for tightly controlled execution behavior
If you only need instruction, tools, security policies, and validation, stay manifest-only.
Two Paths: Standard Runtime vs CustomRuntime
AEGIS provides two runtime modes, specified in spec.runtime:
StandardRuntime (Manifest-Only Path)
Specify language and version; orchestrator determines the official Docker image:
spec:
runtime:
language: python
version: "3.11"
# Orchestrator resolves to: python:3.11-slim (Docker Hub)Best for: Most agents. No image building. Automatic updates.
CustomRuntime (This Path)
Specify a fully-qualified Docker image reference instead:
spec:
runtime:
image: "ghcr.io/my-org/my-agent:v1.0"
image_pull_policy: IfNotPresent # Always | IfNotPresent | NeverMutual Exclusion: image and language/version are mutually exclusive — the orchestrator infers the runtime type from which fields are present:
image | language | version | Result |
|---|---|---|---|
| ❌ | ✅ | ✅ | StandardRuntime ✅ |
| ❌ | ❌ | ❌ | Error — must specify either image OR language+version |
| ❌ | ✅ | ❌ | Error — language requires version |
| ❌ | ❌ | ✅ | Error — version requires language |
| ✅ | ❌ | ❌ | CustomRuntime ✅ |
| ✅ | ✅ | ❌ | Error — cannot specify both image and language |
| ✅ | ❌ | ✅ | Error — cannot specify both image and version |
| ✅ | ✅ | ✅ | Error — cannot specify image + language + version |
Validation is performed at manifest deserialization time; invalid combinations are rejected before any container is started.
Image format: The image value must be fully-qualified and include a registry component (at least one /). Bare names like my-agent:latest are invalid; use myregistry.io/myorg/my-agent:latest.
You must build and push the image (see Step 1 below).
Project Structure
my-agent/
├── agent.yaml
├── bootstrap.py
├── Dockerfile
└── output_schema.jsonStep 1: Build the Container Image
FROM python:3.11-slim
RUN pip install --no-cache-dir aegis-sdk
WORKDIR /agent
COPY bootstrap.py .
COPY output_schema.json .
CMD ["python", "/agent/bootstrap.py"]docker build -t myregistry/my-agent:latest .
docker push myregistry/my-agent:latestStep 2: Implement bootstrap.py
bootstrap.py is the in-container entrypoint. It implements the Aegis Dispatch Protocol — a bidirectional loop over POST /v1/dispatch-gateway that the orchestrator uses to drive the LLM conversation and dispatch in-container commands.
Do not import aegis-sdk inside a custom bootstrap script. The SDK is a control-plane client for deploying agents from outside the runtime. A custom bootstrap must be stdlib-only (Python) or equivalent, to avoid a pip-install dependency before the container is ready. Use the aegis.bootstrap module types from the SDK only as a local reference during development — do not ship the import.
The protocol has three message types:
| Message | Direction | Meaning |
|---|---|---|
AgentMessage {type:"generate"} | bootstrap → orchestrator | Start / continue the inner loop |
OrchestratorMessage {type:"dispatch"} | orchestrator → bootstrap | Run a subprocess inside the container |
OrchestratorMessage {type:"final"} | orchestrator → bootstrap | Inner loop complete; print content and exit |
#!/usr/bin/env python3
"""Custom bootstrap — implements Aegis Dispatch Protocol.
This file runs inside the agent container. It is stdlib-only; no third-party
packages are imported. The orchestrator fully renders the prompt and passes it
as argv[1] before this script is executed.
"""
import json
import os
import subprocess
import sys
import time
import urllib.error
import urllib.request
ORCHESTRATOR_URL = os.environ.get("AEGIS_ORCHESTRATOR_URL", "http://host.docker.internal:8088")
EXECUTION_ID = os.environ["AEGIS_EXECUTION_ID"]
AGENT_ID = os.environ.get("AEGIS_AGENT_ID", "")
ITERATION = int(os.environ.get("AEGIS_ITERATION", "1"))
MODEL_ALIAS = os.environ["AEGIS_MODEL_ALIAS"]
def post_json(payload: dict, timeout: int = 0) -> dict:
"""POST to /v1/dispatch-gateway. timeout=0 disables socket timeout (long-poll)."""
data = json.dumps(payload).encode("utf-8")
req = urllib.request.Request(
f"{ORCHESTRATOR_URL}/v1/dispatch-gateway",
data=data,
headers={"Content-Type": "application/json"},
method="POST",
)
with urllib.request.urlopen(req, timeout=timeout or None) as resp:
return json.loads(resp.read().decode("utf-8"))
def run_dispatch(msg: dict) -> dict:
"""Execute an action dispatched by the orchestrator and return the result."""
dispatch_id = msg["dispatch_id"]
action = msg.get("action")
if action == "exec":
command = [msg["command"]] + msg.get("args", [])
env = os.environ.copy()
env.update(msg.get("env_additions", {}))
timeout_secs = msg.get("timeout_secs", 60)
max_bytes = msg.get("max_output_bytes", 524288)
started = time.monotonic()
try:
result = subprocess.run(
command,
cwd=msg.get("cwd", "/workspace"),
env=env,
capture_output=True,
timeout=timeout_secs,
)
duration_ms = int((time.monotonic() - started) * 1000)
stdout = result.stdout.decode("utf-8", errors="replace")
stderr = result.stderr.decode("utf-8", errors="replace")
truncated = len((stdout + stderr).encode()) > max_bytes
if truncated:
half = max_bytes // 2
stdout, stderr = stdout[-half:], stderr[-half:]
return {
"type": "dispatch_result",
"execution_id": EXECUTION_ID,
"dispatch_id": dispatch_id,
"exit_code": result.returncode,
"stdout": stdout,
"stderr": stderr,
"duration_ms": duration_ms,
"truncated": truncated,
}
except subprocess.TimeoutExpired:
return {
"type": "dispatch_result",
"execution_id": EXECUTION_ID,
"dispatch_id": dispatch_id,
"exit_code": -1,
"stdout": "",
"stderr": f"[AEGIS] Command timed out after {timeout_secs}s",
"duration_ms": timeout_secs * 1000,
"truncated": False,
}
# Unknown action — report gracefully so the orchestrator can inject a tool error.
return {
"type": "dispatch_result",
"execution_id": EXECUTION_ID,
"dispatch_id": dispatch_id,
"exit_code": -1,
"stdout": "",
"stderr": f"unknown_action:{action}",
"duration_ms": 0,
"truncated": False,
}
def main():
prompt = sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read().strip()
if not prompt:
print("Error: no prompt provided", file=sys.stderr)
sys.exit(1)
# Send the initial generate request to start the inner loop.
msg = post_json(
{
"type": "generate",
"agent_id": AGENT_ID,
"execution_id": EXECUTION_ID,
"iteration_number": ITERATION,
"model_alias": MODEL_ALIAS,
"prompt": prompt,
"messages": [],
}
)
# Execute any dispatch commands until the orchestrator issues type="final".
while msg.get("type") == "dispatch":
msg = post_json(run_dispatch(msg))
# Print the final LLM response to stdout; the orchestrator captures it.
print(msg.get("content", ""))
if __name__ == "__main__":
main()Step 3: Add Output Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["solution_path", "output"],
"properties": {
"solution_path": {
"type": "string",
"pattern": "^/workspace/"
},
"output": {
"type": "string",
"minLength": 1
}
},
"additionalProperties": false
}Step 4: Configure agent.yaml
apiVersion: 100monkeys.ai/v1
kind: Agent
metadata:
name: python-coder
version: "1.0.0"
spec:
# CustomRuntime: Specify image instead of language+version
runtime:
image: "ghcr.io/my-org/my-agent:latest"
image_pull_policy: IfNotPresent # Always | IfNotPresent | Never
isolation: docker
task:
instruction: |
Solve the provided coding task and write output to /workspace/result.json.
execution:
mode: iterative
max_iterations: 10
validation:
system:
must_succeed: true
output:
format: json
schema:
type: object
required: ["solution_path", "output"]
properties:
solution_path:
type: string
output:
type: string
security:
network:
mode: allow
allowlist:
- pypi.org
filesystem:
read:
- /workspace
- /agent
write:
- /workspace
resources:
cpu: 1000
memory: "1Gi"
timeout: "300s"
volumes:
- name: workspace
storage_class: ephemeral
mount_path: /workspace
access_mode: read-write
ttl_hours: 1
size_limit: "5Gi"
tools:
- name: filesystem
server: "mcp:filesystem"
config:
allowed_paths: ["/workspace", "/agent"]
access_mode: read-writeSecurity Policy Enforcement
All spec.security policies — network allowlist, filesystem permissions, and resource limits — are enforced by the orchestrator regardless of what the container image contains. A custom image cannot override or bypass these constraints.
This applies to every field under spec.security: network mode and allowlist, filesystem read/write paths, and CPU/memory/timeout limits. The orchestrator layers enforce isolation at the network and filesystem level, independent of what runs inside the image.
Bootstrap Handling
When you specify spec.runtime.image, the orchestrator injects its standard bootstrap script into your container, which manages the 100monkeys iteration loop.
Default behavior (Option A — Orchestrator injects bootstrap):
The orchestrator copies assets/bootstrap.py into the container at /usr/local/bin/aegis-bootstrap if that path is not already present, then executes it. Your image must have Python available.
spec:
runtime:
image: "ghcr.io/myorg/my-agent:latest"
# Bootstrap injected automatically by orchestratorCustom bootstrap (Option B — Bootstrap bundled in image):
Include your own bootstrap script in the image and declare its path in spec.advanced.bootstrap_path. The orchestrator detects the file is already present and skips injection, executing your script directly instead.
spec:
runtime:
image: "ghcr.io/myorg/node-agent:1.0"
advanced:
bootstrap_path: "/agent/bootstrap.js" # Path inside the container# Dockerfile
FROM node:20-alpine
RUN apk add --no-cache python3
COPY bootstrap.js /agent/bootstrap.js
CMD ["tail", "-f", "/dev/null"]The custom bootstrap must implement the dispatch protocol to communicate with the orchestrator via AEGIS_ORCHESTRATOR_URL.
Step 5: Deploy and Execute
aegis agent deploy ./my-agent/agent.yaml
aegis task execute python-coder --input '{"task":"Write a prime checker"}' --followCommon Issues
| Symptom | Cause | Fix |
|---|---|---|
| Image pull failure | Registry auth/image tag issue | Verify image exists and is fully-qualified (registry/org/image:tag); see Container Registry & Image Management for credential setup |
| Startup error in container | Missing package or bad entrypoint | Validate Dockerfile and CMD |
| Tool call rejected | Tool not declared in manifest | Add required tool to spec.tools |
| Timeout during run | Heavy workload or slow dependency | Increase resource timeout or optimize bootstrap flow |
Related Docs
- Writing Your First Agent — Standard (manifest-only) agents
- Standard Runtime Registry — Supported language versions and their Docker images
- Agent Manifest Reference — Complete
spec.runtimefield reference - Container Registry & Image Management — ImagePullPolicy, private registry credentials, and pull failure troubleshooting
- Deploying Agents