Aegis Orchestrator
Deployment

Production Hardening

Security checklist and hardening guide for production AEGIS deployments — TLS, secrets rotation, resource limits, network segmentation, and access control.

Production Hardening

This page provides a checklist and guidance for hardening AEGIS platform deployments for production use.


Pre-Deployment Checklist

ItemActionPriority
Strong passwordsReplace all default passwords in .env (PostgreSQL, Grafana)Critical
SEAL keysGenerate unique Ed25519 keypair (make generate-keys)Critical
TLS terminationDeploy Caddy edge proxy with valid certificatesCritical
Image pinningSet AEGIS_IMAGE_TAG to a pinned semver, not latestHigh
Registry authConfigure GHCR credentials (make registry-login)High
Firewall rulesRestrict all ports except 80/443 to internal networkHigh
Secrets bootstrapInitialize OpenBao (make bootstrap-secrets)High
Log formatSet AEGIS_LOG_FORMAT=json for machine-parseable logsMedium
Monitoring alertsVerify Prometheus alert rules are activeMedium
Backup scheduleConfigure automated backups for PostgreSQL and OpenBaoMedium

TLS Everywhere

External TLS

Deploy the Caddy edge proxy for automatic TLS on all public-facing endpoints. Caddy handles certificate issuance and renewal via ACME.

Internal TLS

For high-security environments, enable TLS on internal pod communication:

  • OpenBao: Set tls_disable = false in openbao-config.hcl and provide certificate paths
  • PostgreSQL: Enable ssl = on in postgresql.conf with server certificates

Internal TLS is optional for single-node deployments where all pods share the same host. For multi-node clusters, TLS between nodes is strongly recommended.


Secrets Management

Credential Rotation

Rotate credentials regularly:

SecretRotation MethodRecommended Interval
PostgreSQL passwordsUpdate .env, restart pod-database and dependents90 days
OpenBao AppRole secret_idRe-run make bootstrap-secrets90 days
SEAL signing keysRegenerate with make generate-keys, restart pod-coreAnnually
GHCR tokenRegenerate GitHub PAT, update .envAnnually
LLM API keysRotate via provider dashboard, update .env or OpenBaoPer provider policy

Avoiding Plaintext Secrets

  • Never commit .env files to version control
  • Use env:VAR_NAME or secret:path credential prefixes in aegis-config.yaml
  • Store sensitive values in OpenBao and reference them via the secret: prefix

Resource Limits

Configure container resource limits in pod YAML definitions to prevent resource exhaustion:

resources:
  limits:
    memory: "4Gi"
    cpu: "2000m"
  requests:
    memory: "1Gi"
    cpu: "500m"

Recommended minimum limits per pod:

PodCPU RequestMemory RequestCPU LimitMemory Limit
pod-core1000m2Gi4000m8Gi
pod-database500m1Gi2000m4Gi
pod-temporal500m1Gi2000m4Gi
pod-observability500m2Gi2000m8Gi
pod-storage250m512Mi1000m2Gi
pod-secrets100m128Mi500m512Mi
pod-seal-gateway250m256Mi1000m1Gi

Network Segmentation

Firewall Rules

Only the Caddy edge proxy should be exposed publicly:

PortProtocolExposurePurpose
80TCPPublicHTTP redirect to HTTPS
443TCPPublicHTTPS (Caddy)
All othersTCPInternal onlyInter-pod communication

Podman Network

All pods run on the aegis-network bridge. Agent containers spawned by the orchestrator also join this network for NFS access to port 2049.

For multi-node deployments, use the cluster protocol (port 50056) with mTLS between nodes. See Multi-Node Deployment.


Access Control

Grafana Access

By default, Grafana allows anonymous viewer access. For production:

  1. Disable anonymous access in Grafana configuration
  2. Set up role-based access control for dashboards

OpenBao Access

  • The OpenBao UI should not be exposed publicly (remove the secrets.* Caddy route)
  • Use AppRole authentication exclusively; avoid root tokens in production
  • Enable audit logging

Image Security

Scanning

Scan container images for vulnerabilities before deployment:

# Using Trivy
trivy image ghcr.io/100monkeys-ai/aegis-runtime:1.2.3
trivy image ghcr.io/100monkeys-ai/aegis-temporal-worker:1.2.3
trivy image ghcr.io/100monkeys-ai/aegis-seal-gateway:1.2.3

Pinned Versions

Always use pinned semver tags in production, never :latest:

# In .env
AEGIS_IMAGE_TAG=1.2.3

Monitoring & Alerting

Verify that all Prometheus alert rules are active:

# Check alert rules via Prometheus API
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[].name'

Configure alert routing in Prometheus Alertmanager for your notification channels (PagerDuty, Slack, email).


See Also

On this page