Troubleshooting
SEAL Gateway runbook — symptom-driven diagnosis and remediation for the failure modes operators actually hit in production.
Troubleshooting
This page is a runbook. Each section is keyed to a symptom an operator will actually see — a log line, an HTTP status, a UI error — and walks through the likely causes, the diagnostic commands that prove which cause it is, and the fix.
When a section says "read the audit event," it means querying
gateway_events directly. There is no REST endpoint for that today; see
Observability for
the SQL recipes.
Signature Verification Failed
Symptom. Invocation requests to /v1/invoke or /v1/seal/invoke
return 401 Unauthorized with a body mentioning signature verification.
Logs include lines like:
WARN signature verification failed: invalid signature for payloadLikely causes.
- The configured
seal_jwt_public_key_pemdoes not match the issuer's private key. - The payload was modified between signing and submission (proxy rewrites bodies, JSON keys reordered after canonicalization, character encoding drift).
- Clock skew large enough to fail the freshness window before the signature is checked.
- The wrong canonicalization algorithm on the client side.
Diagnose.
# Confirm the gateway is loaded with the public key you expect
curl -s http://localhost:8089/health # gateway up
# Print the public key derivable from the signer's private key,
# from the signer's host:
openssl pkey -in /path/to/signer-private.pem -pubout
# Compare byte-for-byte with the gateway's seal_jwt_public_key_pem
# (PEM, including BEGIN/END lines).Fix.
- If the public keys differ, redeploy the gateway with the correct PEM in
SEAL_GATEWAY_SEAL_JWT_PUBLIC_KEY_PEMorspec.auth.seal_jwt_public_key_pem. - If they match, capture the raw bytes the client signed and the raw bytes the gateway received and diff them. Body mutation by a reverse proxy is the most common culprit — disable response/request rewriting on the proxy.
- If clock skew is suspect, sync NTP on both the signer and gateway hosts.
JTI Replay Rejected
Symptom. A request fails with 409 Conflict or 401 Unauthorized and
a message like replay detected: jti already seen.
Likely causes.
- A client retried a request without generating a fresh
jti. - The same envelope was sent through two paths and one of them lost the race (load balancer fan-out, dual proxies).
- A client library has a bug that reuses the
jtifield across calls.
Diagnose.
-- Is the jti currently in the replay window?
SELECT jti, expires_at FROM seen_jtis WHERE jti = '<the jti>';Fix. Ensure the client generates a fresh UUIDv4 (or any
collision-resistant token id) for every envelope, including retries. The
jti is single-use within the freshness window — a retry is conceptually a
new call and must carry a new id.
The replay window is 30 seconds. The background sweep purges expired entries every 30 seconds. If your client retries within that window with the same id, the request is correctly rejected.
Tenant Boundary Violation
Symptom. A request fails with 403 Forbidden and a body mentioning
tenant mismatch or unauthorized tenant.
Likely causes.
- The SEAL token's
tenant_idclaim does not match the tenant of the resource being accessed (e.g. invoking a workflow registered to tenant A with a token issued for tenant B). - A consumer (end-user) token attempted to delegate. Only service-account tokens may delegate to a different tenant.
- The resource is system-global (
tenant_idNULL) and the caller is trying to mutate it without operator scope.
Diagnose.
-- Inspect the resource's tenant
SELECT id, name, tenant_id FROM workflows WHERE name = '<workflow name>';
-- Decode the SEAL token's tenant claim (the bearer JWT inside the envelope)
echo '<security_token>' | cut -d. -f2 | base64 -d | jq .tenant_idFix. Issue the envelope from the correct tenant. If a service account needs to operate across tenants, give it the appropriate delegation grant in the issuer (Keycloak); the gateway will then accept its tenant claim. End-user tokens cannot bypass this.
Container CLI Not Found
Symptom. Startup logs show failed to resolve container CLI: no podman or docker on PATH, or a CLI tool invocation fails with
container CLI execution failed.
Likely causes.
- The gateway is running as a user that has no container runtime on
$PATH. - The gateway image is being run with the host Docker socket mounted but the user inside the container does not have access to it.
- A custom
cli.container_clipath in config points at a binary that is not present in the image.
Diagnose. From inside the gateway container:
podman --version
docker --version
which podman docker
ls -l /var/run/docker.sock 2>/dev/nullThe published gateway image bundles podman and fuse-overlayfs, so
podman --version should always succeed.
Fix.
- For the published image, no action should be needed; if
podmanresolution is failing, you have a corrupted image — repull. - For binary deployments, install
podman(ordocker) on the host and ensure the service user can execute it. - If you are intentionally driving the host Docker daemon, mount
/var/run/docker.sockinto the gateway container and ensure the container user is in thedockergroup on the host.
OpenBao Credential Resolution Failed
Symptom. A CredentialExchangeFailed event lands in the audit table.
Workflows that depend on the credential return 502 Bad Gateway to the
caller.
Likely causes.
SEAL_GATEWAY_OPENBAO_TOKENis invalid or expired.openbao_addris unreachable from the gateway pod (network policy, service name typo).- The KV mount (
openbao_kv_mount, defaultsecret) does not contain the path the credential resolver is asking for. - For dynamic credentials (
SystemJitpath), the OpenBao role does not exist or the token does not have the policy that allows reading it.
Diagnose.
SELECT created_at, payload->>'resolution_path' AS path, payload->>'reason' AS reason
FROM gateway_events
WHERE event_type = 'CredentialExchangeFailed'
ORDER BY created_at DESC
LIMIT 20;# From inside the gateway container
curl -s -H "X-Vault-Token: $SEAL_GATEWAY_OPENBAO_TOKEN" \
"$SEAL_GATEWAY_OPENBAO_ADDR/v1/sys/health" | jq
# Read the secret the resolver is asking for
curl -s -H "X-Vault-Token: $SEAL_GATEWAY_OPENBAO_TOKEN" \
"$SEAL_GATEWAY_OPENBAO_ADDR/v1/secret/data/<path>" | jqFix. Whichever sub-cause the diagnostics point at — rotate the token,
fix the network path, create the missing secret, attach the missing
policy. The reason field on the audit event is the precise OpenBao error
string, copy-paste from there.
Keycloak Token Exchange Returned 401
Symptom. A HumanDelegated credential exchange fails. Logs include
token exchange returned 401 from the Keycloak endpoint.
Likely causes.
- The Keycloak client used for exchange (
keycloak_client_id) is not permitted to perform token exchange. Theurn:ietf:params:oauth:grant-type:token-exchangegrant must be enabled on the client. - The client secret (
keycloak_client_secret) is wrong or rotated. - The audience requested in the exchange does not match an audience the client is allowed to mint tokens for.
- The subject token (the user's session token) is expired by the time the exchange attempt happens.
Diagnose. Reproduce the call directly against Keycloak:
curl -s -X POST "$SEAL_GATEWAY_KEYCLOAK_TOKEN_EXCHANGE_URL" \
-d "grant_type=urn:ietf:params:oauth:grant-type:token-exchange" \
-d "client_id=$SEAL_GATEWAY_KEYCLOAK_CLIENT_ID" \
-d "client_secret=$SEAL_GATEWAY_KEYCLOAK_CLIENT_SECRET" \
-d "subject_token=<user access token>" \
-d "audience=<target audience>" | jqThe response body identifies which grant or audience is misconfigured.
Fix. In the Keycloak admin UI, on the gateway's exchange client:
- Enable
Token Exchangeunder Capability config. - Under Client scopes / Authorization, allow the target audience.
- Rotate and refresh the client secret if it does not match.
Workflow Step Failed at Step N
Symptom. Caller sees 502 or 500 on /v1/invoke. Audit table shows
a WorkflowInvocationFailed event with failed_step set.
Likely causes.
- A Handlebars template referenced a variable that no earlier step extracted.
- A JSONPath extractor returned no match (the upstream response shape changed).
- The upstream HTTP call returned a non-2xx that the step's
on_error: failpolicy escalated. - Network failure to the upstream API.
Diagnose. Read the failure event for the failing execution:
SELECT payload
FROM gateway_events
WHERE event_type = 'WorkflowInvocationFailed'
AND payload->>'execution_id' = '<exec id>'
ORDER BY created_at DESC LIMIT 1;The reason field carries the exact error — template render failure,
extractor miss, upstream HTTP status, transport error. Cross-reference
with the corresponding WorkflowStepExecuted events for the same
execution_id to see where the data flow broke.
Fix.
- Template error: correct the variable name in the workflow's step body template.
- Extractor miss: update the JSONPath to match the new response shape, or
add a default with
on_error: continueif the field is genuinely optional. - Upstream failure: investigate the upstream service. The step's recorded
http_statustells you whether it is a 4xx (your request) or 5xx (their service).
Pool Timed Out
Symptom. Logs include lines like:
ERROR sqlx::pool: database pool acquire timed out — request path starvedRequests start returning 503 Service Unavailable or 500 Internal Server Error under load.
Likely causes.
- The database connection pool is too small for the offered load.
- A long-running query is holding a connection.
- The database itself is the bottleneck (CPU, I/O, or its own connection limit).
Diagnose.
-- Postgres: who is connected and what are they doing?
SELECT pid, usename, state, query_start, query
FROM pg_stat_activity
WHERE datname = 'gateway'
ORDER BY query_start;# Watch gateway logs for sustained pool-timeout messages versus a one-off
journalctl -u aegis-seal-gateway -f | grep -i 'pool acquire'Fix.
- Scale the database tier (CPU, RAM, connection limit).
- Add gateway replicas to spread the connection demand (Postgres only — SQLite cannot scale this way).
- If a specific query is the offender (large
gateway_eventsscan, for instance), archive the audit table to keep its working set small.
Do not increase the per-connection acquire timeout to mask the symptom — the timeout is short by design so starvation surfaces fast.
Operator JWT 401
Symptom. Calls to operator endpoints (/v1/specs, /v1/workflows,
/v1/cli-tools, /v1/security-contexts) return 401 Unauthorized even
with a token that worked yesterday.
Likely causes.
- The configured
operator_jwt_issueroroperator_jwt_audiencedoes not match the token's claims. - The JWKS document at
operator_jwks_uriis stale and missing a new signing key the IdP rotated to. - The token's
aegis_roleclaim (or whichever claimoperator_role_claimnames) does not containoperatororplatform-admin.
Diagnose.
# Decode the token (header + payload only — do not paste this anywhere)
echo '<token>' | cut -d. -f2 | base64 -d | jq
# Verify expected issuer / audience / role claim against the gateway config# Force a JWKS refresh by restarting the gateway. The default cache TTL
# is 300s — a hot rotation may be failing because of a stale cache.
systemctl restart aegis-seal-gateway # or kubectl rollout restart …Fix.
- Realign the gateway's
operator_jwt_issuerandoperator_jwt_audiencewith what your IdP actually mints. - Lower
jwks_cache_ttl_secsif your IdP rotates aggressively. - Add the operator role to the token's role claim in the IdP.
Semantic Judge Unreachable
Symptom. CLI tools registered with require_semantic_judge: true
fail. The audit table contains CliToolSemanticRejected events with
rejection_reason mentioning network or timeout errors against the judge
endpoint.
The semantic judge fails closed. If SEAL_GATEWAY_SEMANTIC_JUDGE_URL
is unset, or the configured endpoint is unreachable, every CLI invocation
that requires the judge is rejected. There is no fallback to "permit on
unreachable" — that would defeat the policy.
Likely causes.
cli.semantic_judge_urlis unset but a registered tool requires it.- The judge service is down or behind a network partition.
- The judge endpoint is returning non-2xx (auth misconfiguration on the judge side, model overloaded).
Diagnose.
curl -s "$SEAL_GATEWAY_SEMANTIC_JUDGE_URL" -o /dev/null -w "%{http_code}\n"SELECT created_at, payload->>'tool_name' AS tool, payload->>'rejection_reason' AS reason
FROM gateway_events
WHERE event_type = 'CliToolSemanticRejected'
ORDER BY created_at DESC LIMIT 20;Fix. Restore the judge service. If you are intentionally running
without one, re-register the affected tools with require_semantic_judge: false — but understand that this removes intent-level policy enforcement
and falls back to the per-tool subcommand allowlist alone.
Log Filter Recipes
A working set of RUST_LOG strings for common debugging sessions. Pick
one, restart the gateway, reproduce the failure, then revert.
| Goal | RUST_LOG |
|---|---|
| Trace SEAL envelope verification | info,aegis_seal_gateway::infrastructure::auth=trace |
| Trace workflow engine end-to-end | info,aegis_seal_gateway::application::workflow=trace |
| Trace CLI invocation pipeline | info,aegis_seal_gateway::application::cli=trace |
| Trace credential resolver | info,aegis_seal_gateway::application::credential=trace |
| Trace JTI replay sweep | info,aegis_seal_gateway::infrastructure::persistence=debug |
| Trace HTTP client to upstream APIs | info,aegis_seal_gateway::infrastructure::http_client=trace,reqwest=debug |
| Quiet noisy infrastructure crates | info,sqlx=warn,h2=warn,hyper=warn,tonic=warn |
| Errors only | error |
| Maximum noise (last resort) | trace |
| Selective per-module trace | info,aegis_seal_gateway::application::workflow::engine=trace |
| Database-only debugging | warn,aegis_seal_gateway::infrastructure::persistence=debug,sqlx=info |
When to File an Issue
If the diagnostics above do not lead to a fix, open an issue at
github.com/100monkeys-ai/aegis-seal-gateway/issues with:
- Gateway version. From the image tag or
aegis-seal-gateway --version. - Storage backend. SQLite vs Postgres, including Postgres major version.
- Container CLI. Output of
podman --version(ordocker --version) from inside the gateway container. - Reproduction. The exact
curl(with secrets redacted) that fails, plus the exact response. - Logs. A copy with
RUST_LOG=info,aegis_seal_gateway=debugcovering the failed request from receipt to error response. - Audit events. The
gateway_eventsrows for the failingexecution_id, JSON-formatted. - Configuration. The effective config (file plus env), with secrets redacted.
Attaching all of that turns most reports into a same-day fix. Attaching only "it does not work" does not.
Observability
Structured logs, audit events, and the telemetry surfaces the SEAL Gateway exposes today — plus the gaps you need to know about.
Integration with AEGIS
How the SEAL Gateway is wired into the AEGIS Orchestrator, including session provisioning, security context coordination, the empty native tool catalog, and co-deployment topology.