🔭 Astrolabe Observatory

Cluster Health Monitoring Dashboard

Loading... Refresh: ~2s

Overview

Agent Status & Data Flow

Metrics Explorer debug

Incident creation debug

🛸 Total Pods

🚀 Running Pods

⭐ Active Agents

⚡ Active Incidents

🤖 Astrolabe Agents

Loading agents...

🌌 Infrastructure Services

Loading infrastructure...

🖥️ Cluster Nodes

Loading nodes...

💾 Storage Status

Loading storage...

🚨 Recent Incidents

Loading incidents...

📊 Data pipeline by cluster

Select a cluster and time period to see: metrics collected, counts at each stage and per agent (in/out), what is persisted in DB, and OK/fail if the pipeline is broken. Errors from agents in the period are listed below.

Cluster Time period

Select a cluster and click Run pipeline check.

🤖 Agent status (health, errors, logs)

Live health from cluster; last error and activity from agent heartbeats (optional cluster filter below).

Switch to this tab or click Refresh to load.

📐 Metrics Explorer debug

Per-cluster service counts (metrics_aggregated_ts vs metrics_raw) and, for a selected cluster, service count by namespace. Use this when not all services appear in Metrics Explorer.

Cluster (optional, for namespace breakdown) Since (hours)

Select options and click Run debug.

When counts are low – is it the collector?

Compare the two tables above. If metrics_raw also has a low distinct service count for that cluster → the gap is ingestion (data never reached the platform). The pipeline is: Collector (in cluster) → Voyager → NATS → Sync → metrics_raw. So you must check the collector (and then Voyager/Sync).
If metrics_raw has many services but metrics_aggregated_ts has fewer → the drop is at aggregation (Analytics / RED feeder), not the collector.
Collector-side checks (when metrics_raw is low):
- On the cluster where services are missing, check collector logs for Collector discovery summary: namespaces=N distinct_service_ids=M pod_count=…. If N or M is low, the collector is only discovering a subset (RBAC or Metrics API).
- RBAC: Collector must be able to list pods in all namespaces and read Metrics API (metrics.k8s.io/v1beta1). Run kubectl get pods -A from the collector’s ServiceAccount; if it fails or is limited, fix Role/ClusterRole.
- Metrics API: If metrics-server (or EKS equivalent) doesn’t expose metrics for some namespaces or returns errors, those pods are skipped. Check collector logs for ApiException/errors.
- Labels: service_id comes from pod label app or app.kubernetes.io/name or pod name. Missing labels can change how many “services” you see.
- If collector logs show high distinct_service_ids but the DB has few → the drop is after the collector (Voyager, Sync, or cluster_id resolution). Check Sync logs for “Stored metric” with that cluster_id and “Skipping … cluster_id could not be resolved”.
Full step-by-step: docs/metrics/DEBUG_MISSING_SERVICES_METRICS_EXPLORER.md (sections 0–2 and 5).

🧪 Incident creation debug

Walkthrough matches scripts/debug-incident-creation.py: publish a Voyager-style message on NATS (astrolabe.events.<event_type>) so Astra sees the same path as HTTPS collectors. Then verify agent_incidents and synced incidents.

Expected flow (debug checklist)

Cluster — Pick an enrolled cluster (same UUID as in clusters / collector config).
Signal — slo (SLO breach, immediate) and k8s (matrix-style) are most reliable one-shot; perf may be episode-gated.
NATS publish — Observatory publishes to astrolabe.events.<type> (not bare anomaly.detected), matching Voyager.
Astra — Subscribes to bare + astrolabe.events.*; processes payload (confidence, impact, dedup).
DB — New rows in agent_incidents; REST sync → incidents for triage UI.
From your laptop — If Voyager HTTP fails with DNS (e.g. Windows 11001), use this tab from inside the cluster or port-forward Observatory.

Cluster

Signal

Service (payload)

Namespace

After publish: run pipeline check on Agent Status & Data Flow for the same cluster, or check Astra logs.

Publish a signal to see API steps and JSON.