Aegis Orchestrator
Reference

Metrics Reference

Complete catalog of Prometheus metrics exposed by the AEGIS orchestrator, including names, types, labels, and descriptions.

Metrics Reference

AEGIS exposes operational metrics in the Prometheus text format. This page lists all canonical metrics provided by the orchestrator and its subsystems.

All metric names are prefixed with aegis_.


Agent Lifecycle (BC-1)

MetricTypeLabelsDescription
aegis_agents_totalGaugestatusNumber of agents by status (deployed, paused, archived).

Execution Context (BC-2)

MetricTypeLabelsDescription
aegis_executions_totalCounterstatusCumulative executions by terminal status (completed, failed, cancelled).
aegis_executions_activeGaugeNumber of executions currently in the running state.
aegis_execution_duration_secondsHistogramstatusWall-clock duration from start to terminal event.
aegis_iteration_countHistogramfinal_statusNumber of iterations performed per execution.
aegis_iterations_totalCounterstatusCumulative iterations by outcome (success, failed, refining).

Workflow Orchestration (BC-3)

MetricTypeLabelsDescription
aegis_workflow_executions_totalCounterstatusCumulative workflow executions by terminal status.
aegis_workflow_executions_activeGaugeNumber of workflow executions currently in progress.
aegis_workflow_state_transitions_totalCounterfrom_kind, to_kindCumulative state transitions between state types.

Security & SEAL (BC-4)

MetricTypeLabelsDescription
aegis_seal_attestations_totalCounterresultCumulative attestation attempts (success, failure).
aegis_seal_sessions_activeGaugeNumber of active SEAL sessions.
aegis_seal_policy_violations_totalCounterviolation_typeCumulative policy violations by type.
aegis_seal_token_refreshes_totalCounterresultCumulative SecurityToken refresh attempts.
aegis_seal_signature_failures_totalCounterCumulative SealEnvelope signature verification failures.

Swarm Coordination (BC-6)

MetricTypeLabelsDescription
aegis_swarms_activeGaugeNumber of swarms currently active.
aegis_swarm_child_spawns_totalCounterresultCumulative child spawn attempts.
aegis_swarm_cascade_cancellations_totalCounterCumulative swarms dissolved via cascade cancellation.
aegis_swarm_lock_contentions_totalCounterCumulative lock requests that were blocked.
aegis_swarm_lock_expirations_totalCounterCumulative resource locks that expired via TTL.

Cluster Coordination (BC-16)

MetricTypeLabelsDescription
aegis_cluster_peersGaugestatusNumber of cluster peers by status (active, draining, unhealthy).
aegis_cluster_heartbeat_age_secondsHistogramAge of the most recent heartbeat from each worker at evaluation time.
aegis_cluster_health_statusGaugeOverall cluster health (1 = healthy, 0 = degraded).

Storage Gateway & NFS (BC-7)

MetricTypeLabelsDescription
aegis_nfs_operations_totalCounteroperation, resultCumulative NFS FSAL operations (lookup, read, write, etc).
aegis_nfs_operation_duration_secondsHistogramoperationLatency of NFS operations.
aegis_nfs_bytes_read_totalCounterCumulative bytes read via NFS.
aegis_nfs_bytes_written_totalCounterCumulative bytes written via NFS.
aegis_nfs_path_traversal_blocked_totalCounterBlocked directory traversal attempts.
aegis_nfs_policy_violations_totalCounterFilesystem policy violations.
aegis_volumes_activeGaugestorage_classNumber of active volumes (ephemeral, persistent).

Stimulus-Response (BC-8)

MetricTypeLabelsDescription
aegis_stimuli_ingested_totalCountersource, routing_modeCumulative stimuli ingested by source and mode.
aegis_stimuli_rejected_totalCounterreasonCumulative stimuli rejected by the gateway.
aegis_stimulus_routing_duration_secondsHistogramrouting_modeTime taken to route a stimulus to a workflow.

Event Bus

MetricTypeLabelsDescription
aegis_event_bus_published_totalCounterevent_typeTotal events published to the in-process bus.
aegis_event_bus_dropped_totalCountersubscriberEvents dropped due to full subscriber buffers.
aegis_event_bus_subscriber_lagGaugesubscriberCurrent number of unprocessed events per subscriber.

HTTP API

MetricTypeLabelsDescription
aegis_http_requests_totalCountermethod, path_template, status_codeTotal HTTP requests handled.
aegis_http_request_duration_secondsHistogrammethod, path_templateHTTP request latency.
aegis_http_requests_in_flightGaugeNumber of HTTP requests currently being processed.

gRPC API

MetricTypeLabelsDescription
aegis_grpc_requests_totalCountermethod, codeTotal gRPC calls by method and status code.
aegis_grpc_request_duration_secondsHistogrammethodgRPC method latency.

System

MetricTypeLabelsDescription
aegis_node_infoGaugenode_id, name, region, versionStatic node identification (value is always 1).
aegis_node_uptime_secondsGaugeSeconds since the daemon started.

Grafana Dashboard Plan

To provide comprehensive visibility into the AEGIS platform, we recommend implementing the following Grafana dashboards using the metrics listed above:

DashboardKey Panels
AEGIS OverviewActive executions gauge, execution throughput (rate), P99 duration heatmap, active SEAL sessions, and event bus drop rate.
Security & PolicySEAL attestation success/failure ratio, policy violations by type (e.g., ToolExplicitlyDenied), and NFS path traversal attempts.
Storage (NFS)I/O operations/sec by type, latency percentiles, throughput (MB/s), and active volume count by storage class.
Execution Deep DiveIteration count distribution, terminal status breakdown, and resource utilization (CPU/Mem) per execution node.
Cortex & KnowledgePattern indexed/pruned rate, search latency, and embedding cache hit/miss ratio.

Prometheus Alerting Rules

The following alerting rules provide baseline protection for critical AEGIS operational signals.

groups:
  - name: aegis_alerts
    rules:
      # Alert on high rate of security policy violations
      - alert: HighSealPolicyViolations
        expr: rate(aegis_seal_policy_violations_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High rate of SEAL policy violations on {{ $labels.instance }}"
          description: "AEGIS blocked more than 0.1 violations/sec over the last 5m. This may indicate a misconfigured agent or a prompt injection attempt."

      # Alert on dropped events (indicates subscriber lag or buffer exhaustion)
      - alert: EventBusDroppedEvents
        expr: increase(aegis_event_bus_dropped_total[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Domain events are being dropped"
          description: "The internal event bus for {{ $labels.subscriber }} is full. This usually indicates that the subscriber is too slow to process incoming events."

      # Alert on attestation failures
      - alert: SealAttestationFailure
        expr: rate(aegis_seal_attestations_total{result="failure"}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Elevated SEAL attestation failures"
          description: "Attestation is failing at a rate of {{ $value }}/sec. Check node clock skew or expired NodeSecurityTokens."

On this page