08 — Incident Response
Runbooks and escalation procedures for responding to alerts, service outages, and infrastructure incidents. Alert source: Alertmanager (vps-i1) → email to radieu@gmail.com.
Incident severity: P1 Critical → immediate page; P2 Warning → daily digest (group_wait 4h).
Key Documents
| Document | Description |
|---|---|
| alert-response-runbook.md | Per-alert response procedures — what each Prometheus alert means and how to fix it |
| runbook.md | General infrastructure runbook — common procedures |
| hostinger-runbook.md | vps-h1 / WAHA-specific runbook |
| waha-incident-router.md | WAHA → Cloudflare Worker → Supabase incident thread router |
| nightly-checks-triage.md | Nightly automated checks and triage output |
Incident Classification
| Severity | Response time | Examples |
|---|---|---|
| P1 Critical | Immediate | MongoDB PRIMARY down, bms-1 disk full, WAHA gateway down, Supabase unreachable |
| P2 Warning | Within 4h | Backup stale, exporter scrape failing, high queue depth |
| P3 Info | Next business day | Cert expiry >30d, credential age warning |
WhatsApp Incident Channel
Fleet incidents from drivers arrive via WhatsApp → WAHA gateway (vps-h1) → waha.infra.zintegrowana.online Cloudflare Worker → Supabase incidents table. See waha-incident-router.md for message routing logic.
Common Response Procedures
Monitoring stack down (vps-i1)
ssh root@217.154.82.162 "cd /opt/p24-infra/monitoring && docker compose ps && docker compose up -d"Alert not firing (check Alertmanager)
ssh root@217.154.82.162 "curl -s http://localhost:9093/api/v2/alerts | jq ."Prometheus target scrape failing
ssh root@217.154.82.162 "curl -s http://localhost:9090/-/reload" # hot-reload configImprovement Proposals
| Proposal | Description |
|---|---|
| 13-hostinger-runbook.md | Enhance Hostinger runbook coverage |