Diagnostic commands, common failure patterns, service dependency chains, and log analysis for AEGIS platform deployments.

Troubleshooting

This guide covers diagnostic procedures for common issues in AEGIS platform deployments.

Diagnostic Commands

Quick Health Check

# Check all pod status
make status

# Validate all service health endpoints
make validate

# View overall system state
podman pod ps
podman ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Log Inspection

# Stream logs for a specific pod
make logs POD=core
make logs POD=database
make logs POD=temporal

# Tail specific container logs
podman logs -f --tail 100 aegis-core-aegis-runtime
podman logs -f --tail 100 aegis-database-postgres

# Search logs for errors
podman logs aegis-core-aegis-runtime 2>&1 | grep -i error

Grafana Log Explorer

For structured log searching, use the Grafana Logs Explorer dashboard at http://localhost:3300:

Navigate to Explore, then select Loki datasource
Query: {container_name="aegis-runtime"} |= "error"
Filter by time range and log level

Service Dependency Chain

When troubleshooting startup failures, check dependencies in this order:

1. pod-database (PostgreSQL)         <- Everything depends on this
   +-- 2. pod-secrets (OpenBao)      <- Core needs secrets
       +-- 3. pod-core               <- Needs DB, secrets
           |-- 4. pod-temporal       <- Worker needs core gRPC
           |-- 5. pod-storage        <- Core connects to filer
           +-- 6. pod-seal-gateway   <- Needs DB, Podman socket
7. pod-observability                  <- Independent, but needs targets running

If pod-core fails to start, check pod-database and pod-secrets first.

Common Failure Patterns

AEGIS Runtime Won't Start

Symptoms: pod-core container exits immediately or enters restart loop.

# Check logs
make logs POD=core

# Common causes:
# 1. Database unreachable
podman exec aegis-database-postgres pg_isready -U aegis

# 2. Invalid aegis-config.yaml
# Look for YAML parse errors in logs

# 3. SEAL keys not generated
ls -la /path/to/seal-keys/
make generate-keys  # if missing

Database Connection Refused

Symptoms: connection refused errors in runtime logs.

# Check if database is running
podman pod ps | grep database

# Check PostgreSQL logs
podman logs aegis-database-postgres

# Test connectivity
podman exec aegis-database-postgres pg_isready -U aegis

# Common fix: ensure pod-database started before pod-core
make redeploy POD=database
sleep 10
make redeploy POD=core

Temporal Connection Failed

Symptoms: Failed to connect to Temporal in runtime logs; workflows don't execute.

# Check Temporal health
podman exec aegis-temporal-temporal temporal operator cluster health

# Check Temporal logs
podman logs aegis-temporal-temporal

# Verify Temporal UI is accessible
curl -s http://localhost:8233 | head -1

Agent Containers Not Starting

Symptoms: Executions fail with container creation errors.

# Check Podman socket
ls -la /run/user/$(id -u)/podman/podman.sock

# Verify socket is active
systemctl --user status podman.socket

# Check if images are available
podman images | grep python

# Test manual container creation
podman run --rm python:3.11-slim python -c "print('ok')"

# Check AEGIS network exists
podman network ls | grep aegis

NFS Mount Failures

Symptoms: Agent containers fail to mount volumes; mount.nfs: Connection refused.

# Check NFS port is listening
ss -tlnp | grep 2049

# Verify from container perspective
podman run --rm --network aegis-network alpine ping -c1 host.containers.internal

# Check runtime logs for NFS errors
make logs POD=core | grep -i nfs

OpenBao Sealed

Symptoms: Runtime fails to resolve secrets; sealed status from OpenBao.

# Check seal status
curl -s http://localhost:8200/v1/sys/health | jq .sealed

# If sealed, unseal with your keys
# See: Disaster Recovery guide

FUSE Daemon Issues

FUSE Daemon Not Running

Symptoms: Executions fail with FUSE daemon unreachable or connection refused on port 50053.

# Check if the FUSE daemon is running
aegis fuse-daemon status

# Check systemd service status (rootless)
systemctl --user status aegis-fuse-daemon

# Start or restart the daemon
aegis fuse-daemon start
# or
systemctl --user restart aegis-fuse-daemon

The FUSE daemon must be started before the orchestrator. If the orchestrator is already running, restart it after starting the FUSE daemon.

FUSE Mount Not Visible in Container

Symptoms: Agent container sees an empty /workspace directory.

# Verify the FUSE mountpoint exists on the host
ls -la /tmp/aegis-fuse/

# Check mount propagation
findmnt | grep aegis-fuse

# Verify the daemon is connected to the orchestrator
aegis fuse-daemon status --output json

This typically indicates a mount propagation issue. Ensure the FUSE mount directory uses bidirectional mount propagation so containers can see host-mounted FUSE filesystems.

Permission Denied on FUSE Operations

Symptoms: Agent gets EACCES or Permission denied when accessing /workspace files.

# Check the FUSE daemon logs
journalctl --user -u aegis-fuse-daemon --since "5 min ago"

# Verify the execution owns the volume
# Look for UnauthorizedVolumeAccess events in orchestrator logs
make logs POD=core | grep -i "unauthorized\|volume\|fuse"

Permission errors on FUSE mounts follow the same AegisFSAL authorization model as NFS. Check that the execution's manifest FilesystemPolicy allows the operation.

Stale FUSE Mounts

Symptoms: Mountpoints remain under /tmp/aegis-fuse/ after executions end; new executions fail with mount point busy.

# List stale mounts
mount | grep aegis-fuse

# Force unmount a stale FUSE mount
fusermount -u /tmp/aegis-fuse/<volume_id>

# Restart the FUSE daemon to clean up all mounts
aegis fuse-daemon stop
aegis fuse-daemon start

Stale mounts can occur if the orchestrator crashes without calling Unmount. Restarting the FUSE daemon unmounts all active mountpoints and starts fresh.

Network Debugging

# Check pod network connectivity
podman exec aegis-core-aegis-runtime curl -s http://aegis-database:5432 || echo "expected - not HTTP"
podman exec aegis-core-aegis-runtime curl -s http://aegis-temporal:7233 || echo "expected - not HTTP"

# DNS resolution within pods
podman exec aegis-core-aegis-runtime getent hosts aegis-database

# Check network
podman network inspect aegis-network

Resource Exhaustion

# Check disk usage
df -h
podman system df

# Check memory usage per container
podman stats --no-stream

# Clean up unused images and containers
podman system prune -f

# Check PostgreSQL connections
podman exec aegis-database-postgres psql -U aegis -c "SELECT count(*) FROM pg_stat_activity;"

Collecting Diagnostics for Support

When reporting issues, collect:

# System info
uname -a
podman version
podman info

# Pod status
make status > diagnostics.txt 2>&1

# Recent logs (last 500 lines per service)
for pod in core database temporal secrets storage observability seal-gateway; do
  echo "=== $pod ===" >> diagnostics.txt
  make logs POD=$pod 2>&1 | tail -500 >> diagnostics.txt
done

# Health check results
make validate >> diagnostics.txt 2>&1

Troubleshooting

Troubleshooting

Diagnostic Commands

Quick Health Check

Log Inspection

Grafana Log Explorer

Service Dependency Chain

Common Failure Patterns

AEGIS Runtime Won't Start

Database Connection Refused

Temporal Connection Failed

Agent Containers Not Starting

NFS Mount Failures

OpenBao Sealed

FUSE Daemon Issues

FUSE Daemon Not Running

FUSE Mount Not Visible in Container

Permission Denied on FUSE Operations

Stale FUSE Mounts

Network Debugging

Resource Exhaustion

Collecting Diagnostics for Support

See Also

On this page