🔭 Astrolabe Observatory

Cluster Health Monitoring Dashboard

Loading... Refresh: ~2s
Overview
Agent Status & Data Flow
Metrics Explorer debug
Incident creation debug
🛸 Total Pods
-
🚀 Running Pods
-
⭐ Active Agents
-
⚡ Active Incidents
-

🤖 Astrolabe Agents

Loading agents...

🌌 Infrastructure Services

Loading infrastructure...

🖥️ Cluster Nodes

Loading nodes...

💾 Storage Status

Loading storage...

🚨 Recent Incidents

Loading incidents...

📊 Data pipeline by cluster

Select a cluster and time period to see: metrics collected, counts at each stage and per agent (in/out), what is persisted in DB, and OK/fail if the pipeline is broken. Errors from agents in the period are listed below.

Select a cluster and click Run pipeline check.

🤖 Agent status (health, errors, logs)

Live health from cluster; last error and activity from agent heartbeats (optional cluster filter below).

Switch to this tab or click Refresh to load.

📐 Metrics Explorer debug

Per-cluster service counts (metrics_aggregated_ts vs metrics_raw) and, for a selected cluster, service count by namespace. Use this when not all services appear in Metrics Explorer.

Select options and click Run debug.

When counts are low – is it the collector?

  • Compare the two tables above. If metrics_raw also has a low distinct service count for that cluster → the gap is ingestion (data never reached the platform). The pipeline is: Collector (in cluster) → Voyager → NATS → Sync → metrics_raw. So you must check the collector (and then Voyager/Sync).
  • If metrics_raw has many services but metrics_aggregated_ts has fewer → the drop is at aggregation (Analytics / RED feeder), not the collector.
  • Collector-side checks (when metrics_raw is low):
    • On the cluster where services are missing, check collector logs for Collector discovery summary: namespaces=N distinct_service_ids=M pod_count=…. If N or M is low, the collector is only discovering a subset (RBAC or Metrics API).
    • RBAC: Collector must be able to list pods in all namespaces and read Metrics API (metrics.k8s.io/v1beta1). Run kubectl get pods -A from the collector’s ServiceAccount; if it fails or is limited, fix Role/ClusterRole.
    • Metrics API: If metrics-server (or EKS equivalent) doesn’t expose metrics for some namespaces or returns errors, those pods are skipped. Check collector logs for ApiException/errors.
    • Labels: service_id comes from pod label app or app.kubernetes.io/name or pod name. Missing labels can change how many “services” you see.
    • If collector logs show high distinct_service_ids but the DB has few → the drop is after the collector (Voyager, Sync, or cluster_id resolution). Check Sync logs for “Stored metric” with that cluster_id and “Skipping … cluster_id could not be resolved”.
  • Full step-by-step: docs/metrics/DEBUG_MISSING_SERVICES_METRICS_EXPLORER.md (sections 0–2 and 5).

🧪 Incident creation debug

Walkthrough matches scripts/debug-incident-creation.py: publish a Voyager-style message on NATS (astrolabe.events.<event_type>) so Astra sees the same path as HTTPS collectors. Then verify agent_incidents and synced incidents.

Expected flow (debug checklist)
  1. Cluster — Pick an enrolled cluster (same UUID as in clusters / collector config).
  2. Signalslo (SLO breach, immediate) and k8s (matrix-style) are most reliable one-shot; perf may be episode-gated.
  3. NATS publish — Observatory publishes to astrolabe.events.<type> (not bare anomaly.detected), matching Voyager.
  4. Astra — Subscribes to bare + astrolabe.events.*; processes payload (confidence, impact, dedup).
  5. DB — New rows in agent_incidents; REST sync → incidents for triage UI.
  6. From your laptop — If Voyager HTTP fails with DNS (e.g. Windows 11001), use this tab from inside the cluster or port-forward Observatory.

After publish: run pipeline check on Agent Status & Data Flow for the same cluster, or check Astra logs.

Publish a signal to see API steps and JSON.