Spec 06 — Consolidate health-check workflows
Purpose
There are two health-check workflows that probe overlapping but different services on different schedules:
.github/workflows/health-check.yml— every 6 h. Checks OpenClaw, Traccar, 3 runners, n8n. Auto-opens AND auto-closes aserver-downissue. The good one..github/workflows/infra-health.yml— weekly. Checks IONOS ping, 1 runner, Supabase, n8n. Sends a Discord summary. Only opens an issue on failure; never closes. Largely redundant.
The duplication causes:
- Confusion about which one to trust
- Double-paging on failure (both fire)
- Drift in checked services (Supabase is checked by
infra-healthonly; OpenClaw byhealth-checkonly)
After spec 05 (Blackbox) lands, most of health-check.yml is redundant too — but spec 05 doesn’t cover GH runner status (that’s an API check, not HTTP).
Rulebook
- One workflow for runner + API health, one for HTTP health.
health-check.ymlkeeps the runner/API checks; HTTP probes migrate to Blackbox (spec 05). - Discord summary only on transitions. Don’t spam Discord weekly when everything is fine; only on UP→DOWN and DOWN→UP.
Implementation plan
- Move the Supabase check from
infra-health.ymlintohealth-check.yml(it’s an API check, not HTTP-probe-able from blackbox without secrets). - Delete
.github/workflows/infra-health.yml. - Remove OpenClaw, Traccar, n8n HTTP probes from
health-check.ymlonce spec 05 ships. (Until then, keep them as belt-and-braces.) - Add a
summary-on-transitionstep: query the openserver-downissue’s last comment timestamp; only send Discord if state changed. - Document in
docs/runbook.md: “Health is observed by (a)health-check.ymlfor API/runner state, (b) Prometheus blackbox for HTTP, (c) Prometheus node/container rules for VPS health.”
Acceptance criteria
-
infra-health.ymldeleted -
health-check.ymlincludes Supabase check - Discord receives one message per transition, not per run
- Manually breaking a service produces exactly one Discord alert + one GH issue (not two of each)
Cost impact
0 €. Slight reduction in GH Actions minutes (~5 min/month).
Back-out plan
Restore infra-health.yml from git history.
Risks / open questions
None. This is pure cleanup.