Troubleshooting
Diagnostic commands, common failure patterns, service dependency chains, and log analysis for AEGIS platform deployments.
Troubleshooting
This guide covers diagnostic procedures for common issues in AEGIS platform deployments.
Diagnostic Commands
Quick Health Check
# Check all pod status
make status
# Validate all service health endpoints
make validate
# View overall system state
podman pod ps
podman ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"Log Inspection
# Stream logs for a specific pod
make logs POD=core
make logs POD=database
make logs POD=temporal
# Tail specific container logs
podman logs -f --tail 100 aegis-core-aegis-runtime
podman logs -f --tail 100 aegis-database-postgres
# Search logs for errors
podman logs aegis-core-aegis-runtime 2>&1 | grep -i errorGrafana Log Explorer
For structured log searching, use the Grafana Logs Explorer dashboard at http://localhost:3300:
- Navigate to Explore, then select Loki datasource
- Query:
{container_name="aegis-runtime"} |= "error" - Filter by time range and log level
Service Dependency Chain
When troubleshooting startup failures, check dependencies in this order:
1. pod-database (PostgreSQL) <- Everything depends on this
+-- 2. pod-secrets (OpenBao) <- Core needs secrets
+-- 3. pod-core <- Needs DB, secrets
|-- 4. pod-temporal <- Worker needs core gRPC
|-- 5. pod-storage <- Core connects to filer
+-- 6. pod-seal-gateway <- Needs DB, Podman socket
7. pod-observability <- Independent, but needs targets runningIf pod-core fails to start, check pod-database and pod-secrets first.
Common Failure Patterns
AEGIS Runtime Won't Start
Symptoms: pod-core container exits immediately or enters restart loop.
# Check logs
make logs POD=core
# Common causes:
# 1. Database unreachable
podman exec aegis-database-postgres pg_isready -U aegis
# 2. Invalid aegis-config.yaml
# Look for YAML parse errors in logs
# 3. SEAL keys not generated
ls -la /path/to/seal-keys/
make generate-keys # if missingDatabase Connection Refused
Symptoms: connection refused errors in runtime logs.
# Check if database is running
podman pod ps | grep database
# Check PostgreSQL logs
podman logs aegis-database-postgres
# Test connectivity
podman exec aegis-database-postgres pg_isready -U aegis
# Common fix: ensure pod-database started before pod-core
make redeploy POD=database
sleep 10
make redeploy POD=coreTemporal Connection Failed
Symptoms: Failed to connect to Temporal in runtime logs; workflows don't execute.
# Check Temporal health
podman exec aegis-temporal-temporal temporal operator cluster health
# Check Temporal logs
podman logs aegis-temporal-temporal
# Verify Temporal UI is accessible
curl -s http://localhost:8233 | head -1Agent Containers Not Starting
Symptoms: Executions fail with container creation errors.
# Check Podman socket
ls -la /run/user/$(id -u)/podman/podman.sock
# Verify socket is active
systemctl --user status podman.socket
# Check if images are available
podman images | grep python
# Test manual container creation
podman run --rm python:3.11-slim python -c "print('ok')"
# Check AEGIS network exists
podman network ls | grep aegisNFS Mount Failures
Symptoms: Agent containers fail to mount volumes; mount.nfs: Connection refused.
# Check NFS port is listening
ss -tlnp | grep 2049
# Verify from container perspective
podman run --rm --network aegis-network alpine ping -c1 host.containers.internal
# Check runtime logs for NFS errors
make logs POD=core | grep -i nfsOpenBao Sealed
Symptoms: Runtime fails to resolve secrets; sealed status from OpenBao.
# Check seal status
curl -s http://localhost:8200/v1/sys/health | jq .sealed
# If sealed, unseal with your keys
# See: Disaster Recovery guideFUSE Daemon Issues
FUSE Daemon Not Running
Symptoms: Executions fail with FUSE daemon unreachable or connection refused on port 50053.
# Check if the FUSE daemon is running
aegis fuse-daemon status
# Check systemd service status (rootless)
systemctl --user status aegis-fuse-daemon
# Start or restart the daemon
aegis fuse-daemon start
# or
systemctl --user restart aegis-fuse-daemonThe FUSE daemon must be started before the orchestrator. If the orchestrator is already running, restart it after starting the FUSE daemon.
FUSE Mount Not Visible in Container
Symptoms: Agent container sees an empty /workspace directory.
# Verify the FUSE mountpoint exists on the host
ls -la /tmp/aegis-fuse/
# Check mount propagation
findmnt | grep aegis-fuse
# Verify the daemon is connected to the orchestrator
aegis fuse-daemon status --output jsonThis typically indicates a mount propagation issue. Ensure the FUSE mount directory uses bidirectional mount propagation so containers can see host-mounted FUSE filesystems.
Permission Denied on FUSE Operations
Symptoms: Agent gets EACCES or Permission denied when accessing /workspace files.
# Check the FUSE daemon logs
journalctl --user -u aegis-fuse-daemon --since "5 min ago"
# Verify the execution owns the volume
# Look for UnauthorizedVolumeAccess events in orchestrator logs
make logs POD=core | grep -i "unauthorized\|volume\|fuse"Permission errors on FUSE mounts follow the same AegisFSAL authorization model as NFS. Check that the execution's manifest FilesystemPolicy allows the operation.
Stale FUSE Mounts
Symptoms: Mountpoints remain under /tmp/aegis-fuse/ after executions end; new executions fail with mount point busy.
# List stale mounts
mount | grep aegis-fuse
# Force unmount a stale FUSE mount
fusermount -u /tmp/aegis-fuse/<volume_id>
# Restart the FUSE daemon to clean up all mounts
aegis fuse-daemon stop
aegis fuse-daemon startStale mounts can occur if the orchestrator crashes without calling Unmount. Restarting the FUSE daemon unmounts all active mountpoints and starts fresh.
Network Debugging
# Check pod network connectivity
podman exec aegis-core-aegis-runtime curl -s http://aegis-database:5432 || echo "expected - not HTTP"
podman exec aegis-core-aegis-runtime curl -s http://aegis-temporal:7233 || echo "expected - not HTTP"
# DNS resolution within pods
podman exec aegis-core-aegis-runtime getent hosts aegis-database
# Check network
podman network inspect aegis-networkResource Exhaustion
# Check disk usage
df -h
podman system df
# Check memory usage per container
podman stats --no-stream
# Clean up unused images and containers
podman system prune -f
# Check PostgreSQL connections
podman exec aegis-database-postgres psql -U aegis -c "SELECT count(*) FROM pg_stat_activity;"Collecting Diagnostics for Support
When reporting issues, collect:
# System info
uname -a
podman version
podman info
# Pod status
make status > diagnostics.txt 2>&1
# Recent logs (last 500 lines per service)
for pod in core database temporal secrets storage observability seal-gateway; do
echo "=== $pod ===" >> diagnostics.txt
make logs POD=$pod 2>&1 | tail -500 >> diagnostics.txt
done
# Health check results
make validate >> diagnostics.txt 2>&1See Also
- Disaster Recovery — recovery from failures
- Observability — monitoring and log analysis
- Pod Architecture — container and port reference
- Infrastructure Requirements — dependency matrix