Spec 06 — Consolidate health-check workflows

Purpose

There are two health-check workflows that probe overlapping but different services on different schedules:

  • .github/workflows/health-check.yml — every 6 h. Checks OpenClaw, Traccar, 3 runners, n8n. Auto-opens AND auto-closes a server-down issue. The good one.
  • .github/workflows/infra-health.yml — weekly. Checks IONOS ping, 1 runner, Supabase, n8n. Sends a Discord summary. Only opens an issue on failure; never closes. Largely redundant.

The duplication causes:

  • Confusion about which one to trust
  • Double-paging on failure (both fire)
  • Drift in checked services (Supabase is checked by infra-health only; OpenClaw by health-check only)

After spec 05 (Blackbox) lands, most of health-check.yml is redundant too — but spec 05 doesn’t cover GH runner status (that’s an API check, not HTTP).


Rulebook

  1. One workflow for runner + API health, one for HTTP health. health-check.yml keeps the runner/API checks; HTTP probes migrate to Blackbox (spec 05).
  2. Discord summary only on transitions. Don’t spam Discord weekly when everything is fine; only on UP→DOWN and DOWN→UP.

Implementation plan

  1. Move the Supabase check from infra-health.yml into health-check.yml (it’s an API check, not HTTP-probe-able from blackbox without secrets).
  2. Delete .github/workflows/infra-health.yml.
  3. Remove OpenClaw, Traccar, n8n HTTP probes from health-check.yml once spec 05 ships. (Until then, keep them as belt-and-braces.)
  4. Add a summary-on-transition step: query the open server-down issue’s last comment timestamp; only send Discord if state changed.
  5. Document in docs/runbook.md: “Health is observed by (a) health-check.yml for API/runner state, (b) Prometheus blackbox for HTTP, (c) Prometheus node/container rules for VPS health.”

Acceptance criteria

  • infra-health.yml deleted
  • health-check.yml includes Supabase check
  • Discord receives one message per transition, not per run
  • Manually breaking a service produces exactly one Discord alert + one GH issue (not two of each)

Cost impact

0 €. Slight reduction in GH Actions minutes (~5 min/month).

Back-out plan

Restore infra-health.yml from git history.

Risks / open questions

None. This is pure cleanup.