Alert-Response Runbook (Admin Reactions)

Status: PROPOSED — pending approval. This is the standing procedure for how infra alerts/reports are worked. It is the connective index over the existing per-service ops docs (“workbooks”).

The loop

For every infra report (Alertmanager alert, health-check, [infra-alert] issue):

Detect — read the alert/report; identify the distinct underlying condition and collapse only genuine duplicates (same alert and identity labels — see Duplicate-issue hygiene; N tickets that differ by an identity label like credential are N conditions, not one).
Diagnose — confirm it is real before acting. Probe the target directly; a non-2xx from a blackbox probe is often a false positive (auth, wrong path).
Auto-solve — if the fix is safe and reversible (restart a crashed container, correct a probe, clear a transient), do it and verify the alert clears.
Escalate — if not safely auto-solvable, hand the human the workbook for that condition (the “Admin reaction” column below). Use the human-action label.
No workbook? Write one (or add the reaction here), and request approval via PR before it becomes an executable procedure.

“Safe & reversible” = no data loss, no secret exposure, no prod auth change, easy rollback. Credential rotation, DB changes, and anything on a production gateway are not auto-solve — they escalate.

Alert → admin-reaction matrix

Alert	First check (is it real?)	Auto-solve (if real & safe)	Escalate to human when	Workbook
`EndpointDown` (behind basic_auth: prometheus, alertmanager)	`curl -o /dev/null -w '%{http_code}'` — 401 = UP, false positive	Fix probe: send basic_auth or accept 401, OR probe an unauth health path. PR to `monitoring/prometheus`.	Probe returns `000`/`5xx` after fix	monitoring-stack-operations.md
`EndpointDown` (waha2 root)	`/api/server/status` green ⇒ service UP	Repoint probe to `/api/server/status`	Status path also down	waha-operations.md
`EndpointDown` (vercel `/api/health` → 404)	App root loads ⇒ app UP, route missing	Repoint probe to real health route or app root	App root also down	cloud-services-operations.md
`EndpointDown` (`000`, e.g. n8n.vps-h1)	Truly unreachable — DNS/Traefik/container	Restart container/Traefik on vps-h1 if crashed	Host/network down, needs console	hostinger-runbook.md, n8n-operations.md
`WAHAContainerDown`	Confirm via `docker ps` on vps-h1 + `/api/server/status`	`docker compose up -d waha` if actually stopped	Restart loops / data issue	waha-incident-router.md
`CredentialRotationCritical` / `…Overdue`	`curl :9230/metrics` — `p24_credential_days_overdue=999` ⇒ never rotated	`auto_rotate=true` creds: trigger `credential-rotation.yml`	`auto_rotate=false` creds → always human	password-rotation-procedures.md
`HighCPU`	Identify process via node_exporter / `top`	Kill runaway non-critical proc	Sustained load needs capacity/decision	hostinger-runbook.md, vps-i1-operations.md
`PromtailNotSendingLogs`	Check promtail container + targets on vps-h1	Restart promtail	Config/positions corruption	hostinger-runbook.md
`BackupExporterDown` (suppressed)	Check backup-exporter container	Restart exporter	Backups actually failing	monitoring-exporters-operations.md

Duplicate-issue hygiene

alertmanager-escalation.yml files a GitHub [infra-alert] issue every 15 min while a condition persists.

Dedup by the alert’s identity labels, not just (alertname, instance). Some alerts carry an extra label that distinguishes genuinely different conditions sharing one instance — most notably CredentialRotationCritical, where every overdue secret fires from the same instance=credential-exporter:9230 but a distinct credential. The 16 such issues seen on 2026-06-08 were not 16 duplicates of one condition — they were 16 different overdue credentials, of which only 4 were true (same-credential) duplicates. Collapsing by (alertname, instance) alone would have erased 12 real conditions.

Procedure: group by the identity key (alertname, instance, credential) — i.e. the same labels the escalation script now hashes into its Dedup key. Within each group keep the lowest-numbered issue, close the rest referencing it, and only close the survivor when the alert clears in Alertmanager.

Note (#333, fixed): the escalation script used to dedupe on the Alertmanager fingerprint (a hash of the full label set), so a flip of a volatile label (auto_rotate, criticality, severity) changed the fingerprint and spawned a duplicate. It now dedupes on a stable (alertname, instance, credential) key, so volatile-label flips no longer create duplicates.

Escalation & approval mechanics

Escalate = comment the relevant workbook section on the issue + apply human-action. P1/critical also goes to Discord (already wired in Alertmanager).
New procedure = PR adding/extending a workbook; merge to dev = approved.
Track every worked condition in dev_r_incidents (DevOps tooling table).

First run — live triage 2026-06-08

Verified against Alertmanager :9093 + direct probes from vps-i1.

False positives (monitoring misconfig — fix probes, then close issues)

EndpointDown prometheus.vps-i1, alertmanager.vps-i1 — both 401 (basic_auth). Services UP.
EndpointDown waha2.vps-h1 (root) — 401; /api/server/status green.
WAHAContainerDown — WAHA responding; likely stale container metric.
EndpointDown vercel /api/health ×2 — 404, app up, route missing.
EndpointDown claude-proxy/health — 501; grafana-image-renderer:8081/ — no / route. Verify correct probe paths.

Real — need action

EndpointDown n8n.vps-h1 (+/healthz) — 000, fully unreachable. Investigate on vps-h1 (Traefik/container/DNS).
CredentialRotationCritical ×16 — never-rotated secrets. auto_rotate=true (MYSQL_PASSWORD, CLOUDFLARE_TOKEN, WASABI ×2, SMTP, HSTGR_N8N_API_KEY, SUPABASE_GRAFANA_PASSWORD, OPENAI_MONITORING_TOKEN, EMAIL_SENDER_API_KEY) → can be auto-rotated. auto_rotate=false (ATRAX_AUTH_STRING, HSTGR_N8N_MCP_TOKEN, TRELLO_KEYS, WASABI dupes) → human.
HighCPU vps-h1, PromtailNotSendingLogs vps-h1 — diagnose on vps-h1.

Recommended next actions

Fix the blackbox probes (basic_auth / correct paths) → clears ~7 false alerts. PR to monitoring/.
Dedupe the 25 [infra-alert] issues → keep one per condition.
Investigate n8n.vps-h1 outage (real).
Decide credential rotation: auto-rotate the 9 auto_rotate=true; schedule the rest with you.

p24-infra Docs

Explorer

alert-response-runbook