08 — Incident Response

Runbooks and escalation procedures for responding to alerts, service outages, and infrastructure incidents. Alert source: Alertmanager (vps-i1) → email to radieu@gmail.com.

Incident severity: P1 Critical → immediate page; P2 Warning → daily digest (group_wait 4h).

Key Documents

DocumentDescription
alert-response-runbook.mdPer-alert response procedures — what each Prometheus alert means and how to fix it
runbook.mdGeneral infrastructure runbook — common procedures
hostinger-runbook.mdvps-h1 / WAHA-specific runbook
waha-incident-router.mdWAHA → Cloudflare Worker → Supabase incident thread router
nightly-checks-triage.mdNightly automated checks and triage output

Incident Classification

SeverityResponse timeExamples
P1 CriticalImmediateMongoDB PRIMARY down, bms-1 disk full, WAHA gateway down, Supabase unreachable
P2 WarningWithin 4hBackup stale, exporter scrape failing, high queue depth
P3 InfoNext business dayCert expiry >30d, credential age warning

WhatsApp Incident Channel

Fleet incidents from drivers arrive via WhatsApp → WAHA gateway (vps-h1) → waha.infra.zintegrowana.online Cloudflare Worker → Supabase incidents table. See waha-incident-router.md for message routing logic.

Common Response Procedures

Monitoring stack down (vps-i1)

ssh root@217.154.82.162 "cd /opt/p24-infra/monitoring && docker compose ps && docker compose up -d"

Alert not firing (check Alertmanager)

ssh root@217.154.82.162 "curl -s http://localhost:9093/api/v2/alerts | jq ."

Prometheus target scrape failing

ssh root@217.154.82.162 "curl -s http://localhost:9090/-/reload"  # hot-reload config

Improvement Proposals

ProposalDescription
13-hostinger-runbook.mdEnhance Hostinger runbook coverage

Cross-references

  • README — alert rules and monitoring stack
  • README — operational runbooks per service
  • README — network-layer incidents (DNS, TLS)