08 — Incident Response

Runbooks and escalation procedures for responding to alerts, service outages, and infrastructure incidents. Alert source: Alertmanager (vps-i1) → email to radieu@gmail.com.

Incident severity: P1 Critical → immediate page; P2 Warning → daily digest (group_wait 4h).

Key Documents

Document	Description
alert-response-runbook.md	Per-alert response procedures — what each Prometheus alert means and how to fix it
runbook.md	General infrastructure runbook — common procedures
hostinger-runbook.md	vps-h1 / WAHA-specific runbook
waha-incident-router.md	WAHA → Cloudflare Worker → Supabase incident thread router
nightly-checks-triage.md	Nightly automated checks and triage output

Incident Classification

Severity	Response time	Examples
P1 Critical	Immediate	MongoDB PRIMARY down, bms-1 disk full, WAHA gateway down, Supabase unreachable
P2 Warning	Within 4h	Backup stale, exporter scrape failing, high queue depth
P3 Info	Next business day	Cert expiry >30d, credential age warning

WhatsApp Incident Channel

Fleet incidents from drivers arrive via WhatsApp → WAHA gateway (vps-h1) → waha.infra.zintegrowana.online Cloudflare Worker → Supabase incidents table. See waha-incident-router.md for message routing logic.

Common Response Procedures

Monitoring stack down (vps-i1)

ssh root@217.154.82.162 "cd /opt/p24-infra/monitoring && docker compose ps && docker compose up -d"

Alert not firing (check Alertmanager)

ssh root@217.154.82.162 "curl -s http://localhost:9093/api/v2/alerts | jq ."

Prometheus target scrape failing

ssh root@217.154.82.162 "curl -s http://localhost:9090/-/reload"  # hot-reload config

Improvement Proposals

Proposal	Description
13-hostinger-runbook.md	Enhance Hostinger runbook coverage

Cross-references

README — alert rules and monitoring stack
README — operational runbooks per service
README — network-layer incidents (DNS, TLS)

p24-infra Docs

Explorer

README