Alert-Response Runbook (Admin Reactions)

Status: PROPOSED — pending approval. This is the standing procedure for how infra alerts/reports are worked. It is the connective index over the existing per-service ops docs (“workbooks”).

The loop

For every infra report (Alertmanager alert, health-check, [infra-alert] issue):

  1. Detect — read the alert/report; identify the distinct underlying condition and collapse only genuine duplicates (same alert and identity labels — see Duplicate-issue hygiene; N tickets that differ by an identity label like credential are N conditions, not one).
  2. Diagnose — confirm it is real before acting. Probe the target directly; a non-2xx from a blackbox probe is often a false positive (auth, wrong path).
  3. Auto-solve — if the fix is safe and reversible (restart a crashed container, correct a probe, clear a transient), do it and verify the alert clears.
  4. Escalate — if not safely auto-solvable, hand the human the workbook for that condition (the “Admin reaction” column below). Use the human-action label.
  5. No workbook? Write one (or add the reaction here), and request approval via PR before it becomes an executable procedure.

“Safe & reversible” = no data loss, no secret exposure, no prod auth change, easy rollback. Credential rotation, DB changes, and anything on a production gateway are not auto-solve — they escalate.

Alert → admin-reaction matrix

AlertFirst check (is it real?)Auto-solve (if real & safe)Escalate to human whenWorkbook
EndpointDown (behind basic_auth: prometheus, alertmanager)curl -o /dev/null -w '%{http_code}'401 = UP, false positiveFix probe: send basic_auth or accept 401, OR probe an unauth health path. PR to monitoring/prometheus.Probe returns 000/5xx after fixmonitoring-stack-operations.md
EndpointDown (waha2 root)/api/server/status green ⇒ service UPRepoint probe to /api/server/statusStatus path also downwaha-operations.md
EndpointDown (vercel /api/health → 404)App root loads ⇒ app UP, route missingRepoint probe to real health route or app rootApp root also downcloud-services-operations.md
EndpointDown (000, e.g. n8n.vps-h1)Truly unreachable — DNS/Traefik/containerRestart container/Traefik on vps-h1 if crashedHost/network down, needs consolehostinger-runbook.md, n8n-operations.md
WAHAContainerDownConfirm via docker ps on vps-h1 + /api/server/statusdocker compose up -d waha if actually stoppedRestart loops / data issuewaha-incident-router.md
CredentialRotationCritical / …Overduecurl :9230/metricsp24_credential_days_overdue=999 ⇒ never rotatedauto_rotate=true creds: trigger credential-rotation.ymlauto_rotate=false creds → always humanpassword-rotation-procedures.md
HighCPUIdentify process via node_exporter / topKill runaway non-critical procSustained load needs capacity/decisionhostinger-runbook.md, vps-i1-operations.md
PromtailNotSendingLogsCheck promtail container + targets on vps-h1Restart promtailConfig/positions corruptionhostinger-runbook.md
BackupExporterDown (suppressed)Check backup-exporter containerRestart exporterBackups actually failingmonitoring-exporters-operations.md

Duplicate-issue hygiene

alertmanager-escalation.yml files a GitHub [infra-alert] issue every 15 min while a condition persists.

Dedup by the alert’s identity labels, not just (alertname, instance). Some alerts carry an extra label that distinguishes genuinely different conditions sharing one instance — most notably CredentialRotationCritical, where every overdue secret fires from the same instance=credential-exporter:9230 but a distinct credential. The 16 such issues seen on 2026-06-08 were not 16 duplicates of one condition — they were 16 different overdue credentials, of which only 4 were true (same-credential) duplicates. Collapsing by (alertname, instance) alone would have erased 12 real conditions.

Procedure: group by the identity key (alertname, instance, credential) — i.e. the same labels the escalation script now hashes into its Dedup key. Within each group keep the lowest-numbered issue, close the rest referencing it, and only close the survivor when the alert clears in Alertmanager.

Note (#333, fixed): the escalation script used to dedupe on the Alertmanager fingerprint (a hash of the full label set), so a flip of a volatile label (auto_rotate, criticality, severity) changed the fingerprint and spawned a duplicate. It now dedupes on a stable (alertname, instance, credential) key, so volatile-label flips no longer create duplicates.

Escalation & approval mechanics

  • Escalate = comment the relevant workbook section on the issue + apply human-action. P1/critical also goes to Discord (already wired in Alertmanager).
  • New procedure = PR adding/extending a workbook; merge to dev = approved.
  • Track every worked condition in dev_r_incidents (DevOps tooling table).

First run — live triage 2026-06-08

Verified against Alertmanager :9093 + direct probes from vps-i1.

False positives (monitoring misconfig — fix probes, then close issues)

  • EndpointDown prometheus.vps-i1, alertmanager.vps-i1 — both 401 (basic_auth). Services UP.
  • EndpointDown waha2.vps-h1 (root) — 401; /api/server/status green.
  • WAHAContainerDown — WAHA responding; likely stale container metric.
  • EndpointDown vercel /api/health ×2 — 404, app up, route missing.
  • EndpointDown claude-proxy/health501; grafana-image-renderer:8081/ — no / route. Verify correct probe paths.

Real — need action

  • EndpointDown n8n.vps-h1 (+/healthz) — 000, fully unreachable. Investigate on vps-h1 (Traefik/container/DNS).
  • CredentialRotationCritical ×16 — never-rotated secrets. auto_rotate=true (MYSQL_PASSWORD, CLOUDFLARE_TOKEN, WASABI ×2, SMTP, HSTGR_N8N_API_KEY, SUPABASE_GRAFANA_PASSWORD, OPENAI_MONITORING_TOKEN, EMAIL_SENDER_API_KEY) → can be auto-rotated. auto_rotate=false (ATRAX_AUTH_STRING, HSTGR_N8N_MCP_TOKEN, TRELLO_KEYS, WASABI dupes) → human.
  • HighCPU vps-h1, PromtailNotSendingLogs vps-h1 — diagnose on vps-h1.
  1. Fix the blackbox probes (basic_auth / correct paths) → clears ~7 false alerts. PR to monitoring/.
  2. Dedupe the 25 [infra-alert] issues → keep one per condition.
  3. Investigate n8n.vps-h1 outage (real).
  4. Decide credential rotation: auto-rotate the 9 auto_rotate=true; schedule the rest with you.