Alert-Response Runbook (Admin Reactions)
Status: PROPOSED — pending approval. This is the standing procedure for how infra alerts/reports are worked. It is the connective index over the existing per-service ops docs (“workbooks”).
The loop
For every infra report (Alertmanager alert, health-check, [infra-alert] issue):
- Detect — read the alert/report; identify the distinct underlying condition
and collapse only genuine duplicates (same alert and identity labels — see
Duplicate-issue hygiene; N tickets that differ by an
identity label like
credentialare N conditions, not one). - Diagnose — confirm it is real before acting. Probe the target directly; a non-2xx from a blackbox probe is often a false positive (auth, wrong path).
- Auto-solve — if the fix is safe and reversible (restart a crashed container, correct a probe, clear a transient), do it and verify the alert clears.
- Escalate — if not safely auto-solvable, hand the human the workbook for
that condition (the “Admin reaction” column below). Use the
human-actionlabel. - No workbook? Write one (or add the reaction here), and request approval via PR before it becomes an executable procedure.
“Safe & reversible” = no data loss, no secret exposure, no prod auth change, easy rollback. Credential rotation, DB changes, and anything on a production gateway are not auto-solve — they escalate.
Alert → admin-reaction matrix
| Alert | First check (is it real?) | Auto-solve (if real & safe) | Escalate to human when | Workbook |
|---|---|---|---|---|
EndpointDown (behind basic_auth: prometheus, alertmanager) | curl -o /dev/null -w '%{http_code}' — 401 = UP, false positive | Fix probe: send basic_auth or accept 401, OR probe an unauth health path. PR to monitoring/prometheus. | Probe returns 000/5xx after fix | monitoring-stack-operations.md |
EndpointDown (waha2 root) | /api/server/status green ⇒ service UP | Repoint probe to /api/server/status | Status path also down | waha-operations.md |
EndpointDown (vercel /api/health → 404) | App root loads ⇒ app UP, route missing | Repoint probe to real health route or app root | App root also down | cloud-services-operations.md |
EndpointDown (000, e.g. n8n.vps-h1) | Truly unreachable — DNS/Traefik/container | Restart container/Traefik on vps-h1 if crashed | Host/network down, needs console | hostinger-runbook.md, n8n-operations.md |
WAHAContainerDown | Confirm via docker ps on vps-h1 + /api/server/status | docker compose up -d waha if actually stopped | Restart loops / data issue | waha-incident-router.md |
CredentialRotationCritical / …Overdue | curl :9230/metrics — p24_credential_days_overdue=999 ⇒ never rotated | auto_rotate=true creds: trigger credential-rotation.yml | auto_rotate=false creds → always human | password-rotation-procedures.md |
HighCPU | Identify process via node_exporter / top | Kill runaway non-critical proc | Sustained load needs capacity/decision | hostinger-runbook.md, vps-i1-operations.md |
PromtailNotSendingLogs | Check promtail container + targets on vps-h1 | Restart promtail | Config/positions corruption | hostinger-runbook.md |
BackupExporterDown (suppressed) | Check backup-exporter container | Restart exporter | Backups actually failing | monitoring-exporters-operations.md |
Duplicate-issue hygiene
alertmanager-escalation.yml files a GitHub [infra-alert] issue every 15 min
while a condition persists.
Dedup by the alert’s identity labels, not just (alertname, instance). Some
alerts carry an extra label that distinguishes genuinely different conditions
sharing one instance — most notably CredentialRotationCritical, where every
overdue secret fires from the same instance=credential-exporter:9230 but a
distinct credential. The 16 such issues seen on 2026-06-08 were not 16
duplicates of one condition — they were 16 different overdue credentials, of which
only 4 were true (same-credential) duplicates. Collapsing by (alertname, instance) alone would have erased 12 real conditions.
Procedure: group by the identity key (alertname, instance, credential) — i.e.
the same labels the escalation script now hashes into its Dedup key. Within each
group keep the lowest-numbered issue, close the rest referencing it, and only close
the survivor when the alert clears in Alertmanager.
Note (#333, fixed): the escalation script used to dedupe on the Alertmanager fingerprint (a hash of the full label set), so a flip of a volatile label (
auto_rotate,criticality,severity) changed the fingerprint and spawned a duplicate. It now dedupes on a stable(alertname, instance, credential)key, so volatile-label flips no longer create duplicates.
Escalation & approval mechanics
- Escalate = comment the relevant workbook section on the issue + apply
human-action. P1/critical also goes to Discord (already wired in Alertmanager). - New procedure = PR adding/extending a workbook; merge to
dev= approved. - Track every worked condition in
dev_r_incidents(DevOps tooling table).
First run — live triage 2026-06-08
Verified against Alertmanager :9093 + direct probes from vps-i1.
False positives (monitoring misconfig — fix probes, then close issues)
EndpointDownprometheus.vps-i1, alertmanager.vps-i1 — both 401 (basic_auth). Services UP.EndpointDownwaha2.vps-h1 (root) — 401;/api/server/statusgreen.WAHAContainerDown— WAHA responding; likely stale container metric.EndpointDownvercel/api/health×2 — 404, app up, route missing.EndpointDownclaude-proxy/health — 501; grafana-image-renderer:8081/ — no/route. Verify correct probe paths.
Real — need action
EndpointDownn8n.vps-h1 (+/healthz) — 000, fully unreachable. Investigate on vps-h1 (Traefik/container/DNS).CredentialRotationCritical×16 — never-rotated secrets.auto_rotate=true(MYSQL_PASSWORD, CLOUDFLARE_TOKEN, WASABI ×2, SMTP, HSTGR_N8N_API_KEY, SUPABASE_GRAFANA_PASSWORD, OPENAI_MONITORING_TOKEN, EMAIL_SENDER_API_KEY) → can be auto-rotated.auto_rotate=false(ATRAX_AUTH_STRING, HSTGR_N8N_MCP_TOKEN, TRELLO_KEYS, WASABI dupes) → human.HighCPUvps-h1,PromtailNotSendingLogsvps-h1 — diagnose on vps-h1.
Recommended next actions
- Fix the blackbox probes (basic_auth / correct paths) → clears ~7 false alerts. PR to
monitoring/. - Dedupe the 25
[infra-alert]issues → keep one per condition. - Investigate n8n.vps-h1 outage (real).
- Decide credential rotation: auto-rotate the 9
auto_rotate=true; schedule the rest with you.