Disaster Recovery Runbook

Project: p24-infra / Ecotrans fleet platform Last updated: 2026-06-18 Status: Initial draft — DR has never been formally tested. See §7.

1. RTO / RPO Targets

Component	RTO (max downtime)	RPO (max data loss)	Notes
MongoDB rs0 (Pinbox24 prod)	Unknown	4+ months	No automated backup — local dumps only and months old
bms-1 (Pinbox24 app)	Unknown	4+ months	No automated backup — app stateless, but DB is not
vps-i1 monitoring stack	1 hour	24 hours	Prometheus data in Thanos → Wasabi S3 (2h blocks)
n8n automation (bms-4)	2 hours	7 days (workflow configs)	n8n workflows stored in PostgreSQL on bms-4; manual backup weekly
Supabase (managed)	N/A (SLA)	Point-in-time (managed)	Supabase handles backups; escalate to support
Claude agents (vps-i1, vps-h1, bms-4)	30 min	N/A (stateless)	Re-auth may be needed; see §5.7
vps-h1 WAHA gateway	Best-effort	N/A (stateless)	Messages may be lost while down; no queue
Cloudflare (DNS/Workers)	N/A (SLA)	N/A	Fully managed

Honesty note: RTO/RPO for MongoDB and bms-1 are “unknown” because no recovery has ever been rehearsed. The 4+ month RPO for MongoDB reflects the age of the last known manual dump. Implementing automated backups is the single highest-priority DR gap.

2. DR Incident Classification

Level	Definition	Example	Response
P1 — Data Loss	Irreversible data destruction or corruption	MongoDB disk failure, dropped collection	Immediate all-hands; escalate within 5 min
P2 — Service Down	Production service unavailable; revenue impact	bms-1 host crash, Pinbox24 unreachable	Respond within 15 min; page radieu@gmail.com
P3 — Degraded	Partial failure; some functions unavailable	Monitoring stack down, WAHA offline	Respond within 2 hours; Discord alert sufficient

First response for any incident:

Check Discord #infra-alerts channel for automated alerts
SSH to affected server and run docker ps / systemctl status / df -h
Classify the incident (P1/P2/P3)
Follow the relevant scenario below

3. Backup Status (as of 2026-06-18)

Component	Backup method	Location	Frequency	Age of last backup
MongoDB rs0 (prod)	Manual dump	`/root/w-2026-` on bms-1	NONE	4+ months
MongoDB rs0 (replication)	bms-3 SECONDARY	bms-3 live data	Continuous	Current (live replica)
Prometheus metrics	Thanos sidecar → Wasabi S3	`p24-infra/thanos/`	2h blocks	Current
n8n workflows (bms-4)	Manual / none	bms-4 PostgreSQL only	NONE automated	Unknown
Supabase (et-operational-platform)	Managed by Supabase	Supabase internal	Daily PITR	Current
bms-1 container volumes	NONE	—	NONE	Never
bms-1 PostgreSQL (host-native)	NONE	—	NONE	Never
bms-1 Redis (host-native)	NONE	—	NONE	Never
vps-i1 Grafana config	Git repo	`radieu/p24-infra`	On commit	Current
Claude agent credentials	`/home/claude-runner/.claude/`	VPS local	NONE	N/A (rotates)

Components with NO backup are highlighted above. MongoDB rs0 and bms-1 container volumes represent the most critical unmitigated data-loss risk.

4. Failure Scenarios and Recovery Steps

4.1 bms-2 (MongoDB PRIMARY) failure

Symptoms: Pinbox24 API errors, MongoDB write failures, rs.status() shows bms-2 DOWN.

Recovery:

Do nothing immediately — rs0 has automatic failover
bms-3 (SECONDARY) + bms-4 (ARBITER vote) have quorum (2/3 votes)
bms-3 will be elected PRIMARY within ~10-30 seconds
Verify: ssh ubuntu@51.68.155.224 "mongosh --eval 'rs.status()'" — bms-3 should show PRIMARY
Update application connection strings if they hardcode bms-2’s IP (they should use a replica set URI)
Diagnose and restore bms-2 — on return, it will sync as SECONDARY automatically
Once bms-2 is healthy, optionally transfer PRIMARY back: rs.stepDown(60) on bms-3

Risk while bms-2 is down: only 1 full-data member (bms-3) + 1 arbiter. No read replica.

4.2 bms-3 (MongoDB SECONDARY) failure

Symptoms: rs.status() shows bms-3 DOWN; bms-2 still PRIMARY; Pinbox24 continues working.

Recovery:

rs0 remains fully operational — bms-2 (PRIMARY, 1 vote) + bms-4 (arbiter, 1 vote) = quorum
No application impact
Diagnose bms-3 and restore service: ssh ubuntu@51.68.155.224 "sudo systemctl restart mongod"
bms-3 will resync from bms-2 automatically on restart
Verify resync lag: rs.printSecondaryReplicationInfo() on primary

Risk while bms-3 is down: if bms-2 also fails, only bms-4 arbiter remains — rs0 loses quorum and goes read-only.

4.3 bms-4 (ARBITER) failure

Symptoms: rs.status() shows bms-4 DOWN; bms-2 PRIMARY; bms-3 SECONDARY; both still have 1 vote each.

Recovery:

rs0 remains fully operational — bms-2 + bms-3 have 2 votes (quorum maintained)
No immediate action required for Pinbox24
Restore bms-4: ssh root@54.36.123.110 "sudo systemctl restart mongod"
If bms-4 cannot reach rs0, re-add it as arbiter from the PRIMARY:
```
mongosh --eval "rs.addArb('54.36.123.110:27017')"
```

Risk while bms-4 is down: if bms-2 also fails simultaneously, bms-3 has only 1 vote — cannot reach quorum (needs 2). rs0 goes read-only.

4.4 vps-i1 (IONOS, monitoring stack) failure

Symptoms: Grafana/Prometheus/Alertmanager unreachable; no more metric alerts.

Recovery — host still alive (Docker issue):

ssh root@217.154.82.162
cd /opt/p24-infra/monitoring
docker compose ps
docker compose up -d

Recovery — host unreachable:

Check IONOS control panel for host status
If host needs reboot: use IONOS panel → server restart

If host is destroyed: provision replacement with Ansible:

ansible-playbook ansible/playbooks/provision-new-vps.yml -i ansible/inventory/

Prometheus TSDB data may be lost locally (up to 15 days), but historical data (Thanos blocks) is safe in Wasabi S3 at p24-infra/thanos/
Restore .env.bak from Infisical p24-monitoring project

Business impact: monitoring is internal tooling — no direct user impact, but alerts stop firing.

4.5 vps-h1 (Hostinger, WAHA gateway) failure

Symptoms: WhatsApp messages not received, WAHA unreachable, fleet incident notifications stop.

Recovery:

ssh root@72.60.32.61
cd /root
docker compose ps
docker compose up -d waha

If host unreachable: check Hostinger control panel. vps-h1 is a PROTECTED host — no new services should be added during recovery. Role is fixed: WAHA gateway + monitoring agents.

Manual incident handling: while WAHA is offline, incidents must be tracked manually. Check the Supabase incidents table for open items: supabase.co → Table Editor → incidents.

Business impact: WhatsApp-based fleet incident reports are lost while WAHA is down. No queue — messages sent during downtime are not recoverable.

4.6 bms-1 (Pinbox24 production) failure

Symptoms: api.w4.pinbox24.com, w4.pinbox24.com, api.w3.pinbox24.com unreachable.

Recovery — host alive:

ssh root@94.23.26.113
docker ps   # check which containers are stopped
docker start nginx-proxy v42-prod s3-v42-prod mailgun-v42-prod

Recovery — host destroyed (no automated failover):

This is a catastrophic scenario — no DR plan exists for bms-1 total loss.
The server was provisioned manually (no IaC/Ansible).
MongoDB data may be partially recovered from bms-3 (rs0 SECONDARY if rs0 is intact) or from manual dumps in /root on bms-1 (if disk is recoverable).
Container images for v3.x and untagged containers (v41-prod, v32-prod-socket, v32-prod-reso) may be unrecoverable if not exported to a registry before failure.
Estimated recovery time: days to weeks, depending on what is recoverable.

Mitigation actions required before this scenario is resolved:

Export untagged images to Wasabi or a registry (see bms-1 open tasks)
Implement automated MongoDB dumps to Wasabi
Create Ansible playbook for bms-1 provisioning
Document all container launch parameters

4.7 Supabase (managed service) failure

Symptoms: et-operational-platform database unavailable, Supabase dashboard unreachable.

Recovery:

Check Supabase status page: https://status.supabase.com
If regional outage: wait for Supabase recovery (they manage their own DR)
If project-specific issue: contact Supabase support via dashboard
Point-in-time recovery is available via Supabase dashboard if data corruption occurs

Business impact: fleet management web app (et-operational-platform) is fully unavailable. Monitoring Grafana dashboards that query Supabase directly will fail.

4.8 Claude agent auth expiry

Symptoms: Discord alert “Claude auth expired on AI-Dev-XXX” or agent GitHub Actions runs fail with auth errors.

Recovery:

# On local Windows machine:
python d:\tmp\reauth-hstgr.py

This refreshes OAuth tokens for the affected agent. Credentials stored at /home/claude-runner/.claude/.credentials.json on each VPS.

5. Escalation and Contact

Channel	When to use
Discord `#infra-alerts`	All P1/P2/P3 — automated alerts land here first
`radieu@gmail.com`	P1 data loss or P2 extended outage (>15 min)
OVH control panel	bms-1, bms-2, bms-3, bms-4 host-level issues
IONOS control panel	vps-i1 host-level issues
Hostinger control panel	vps-h1 host-level issues
Supabase dashboard	Supabase-managed service issues

6. DR Test Schedule

Status: Never formally tested as of 2026-06-18.

This is a known gap. The following DR tests should be scheduled:

Test	Frequency	Last tested
MongoDB PRIMARY failover (bms-2 stepDown)	Quarterly	Never
vps-i1 monitoring stack full restart	Quarterly	Never
WAHA gateway restart	Monthly	Never
MongoDB dump restore to staging	Quarterly	Never
Full bms-1 recovery simulation	Annually	Never

Recommended first DR test: MongoDB PRIMARY stepDown (safest — automatic failover, no data risk). Run: mongosh --eval "rs.stepDown(60)" on bms-2, verify bms-3 promotes, then step back.

p24-infra Docs

Explorer

disaster-recovery-runbook

Disaster Recovery Runbook

1. RTO / RPO Targets

2. DR Incident Classification

3. Backup Status (as of 2026-06-18)

4. Failure Scenarios and Recovery Steps

4.1 bms-2 (MongoDB PRIMARY) failure

4.2 bms-3 (MongoDB SECONDARY) failure

4.3 bms-4 (ARBITER) failure

4.4 vps-i1 (IONOS, monitoring stack) failure

4.5 vps-h1 (Hostinger, WAHA gateway) failure

4.6 bms-1 (Pinbox24 production) failure

4.7 Supabase (managed service) failure

4.8 Claude agent auth expiry

5. Escalation and Contact

6. DR Test Schedule

Graph View

Table of Contents

Backlinks

p24-infra Docs

Explorer

disaster-recovery-runbook

Disaster Recovery Runbook

1. RTO / RPO Targets

2. DR Incident Classification

3. Backup Status (as of 2026-06-18)

4. Failure Scenarios and Recovery Steps

4.1 bms-2 (MongoDB PRIMARY) failure

4.2 bms-3 (MongoDB SECONDARY) failure

4.3 bms-4 (ARBITER) failure

4.4 vps-i1 (IONOS, monitoring stack) failure

4.5 vps-h1 (Hostinger, WAHA gateway) failure

4.6 bms-1 (Pinbox24 production) failure

4.7 Supabase (managed service) failure

4.8 Claude agent auth expiry

5. Escalation and Contact

6. DR Test Schedule

7. Related Documentation

Graph View

Table of Contents

Backlinks