Disaster Recovery Runbook
Project: p24-infra / Ecotrans fleet platform Last updated: 2026-06-18 Status: Initial draft — DR has never been formally tested. See §7.
1. RTO / RPO Targets
| Component | RTO (max downtime) | RPO (max data loss) | Notes |
|---|---|---|---|
| MongoDB rs0 (Pinbox24 prod) | Unknown | 4+ months | No automated backup — local dumps only and months old |
| bms-1 (Pinbox24 app) | Unknown | 4+ months | No automated backup — app stateless, but DB is not |
| vps-i1 monitoring stack | 1 hour | 24 hours | Prometheus data in Thanos → Wasabi S3 (2h blocks) |
| n8n automation (bms-4) | 2 hours | 7 days (workflow configs) | n8n workflows stored in PostgreSQL on bms-4; manual backup weekly |
| Supabase (managed) | N/A (SLA) | Point-in-time (managed) | Supabase handles backups; escalate to support |
| Claude agents (vps-i1, vps-h1, bms-4) | 30 min | N/A (stateless) | Re-auth may be needed; see §5.7 |
| vps-h1 WAHA gateway | Best-effort | N/A (stateless) | Messages may be lost while down; no queue |
| Cloudflare (DNS/Workers) | N/A (SLA) | N/A | Fully managed |
Honesty note: RTO/RPO for MongoDB and bms-1 are “unknown” because no recovery has ever been rehearsed. The 4+ month RPO for MongoDB reflects the age of the last known manual dump. Implementing automated backups is the single highest-priority DR gap.
2. DR Incident Classification
| Level | Definition | Example | Response |
|---|---|---|---|
| P1 — Data Loss | Irreversible data destruction or corruption | MongoDB disk failure, dropped collection | Immediate all-hands; escalate within 5 min |
| P2 — Service Down | Production service unavailable; revenue impact | bms-1 host crash, Pinbox24 unreachable | Respond within 15 min; page radieu@gmail.com |
| P3 — Degraded | Partial failure; some functions unavailable | Monitoring stack down, WAHA offline | Respond within 2 hours; Discord alert sufficient |
First response for any incident:
- Check Discord
#infra-alertschannel for automated alerts - SSH to affected server and run
docker ps/systemctl status/df -h - Classify the incident (P1/P2/P3)
- Follow the relevant scenario below
3. Backup Status (as of 2026-06-18)
| Component | Backup method | Location | Frequency | Age of last backup |
|---|---|---|---|---|
| MongoDB rs0 (prod) | Manual dump | /root/w*-2026-* on bms-1 | NONE | 4+ months |
| MongoDB rs0 (replication) | bms-3 SECONDARY | bms-3 live data | Continuous | Current (live replica) |
| Prometheus metrics | Thanos sidecar → Wasabi S3 | p24-infra/thanos/ | 2h blocks | Current |
| n8n workflows (bms-4) | Manual / none | bms-4 PostgreSQL only | NONE automated | Unknown |
| Supabase (et-operational-platform) | Managed by Supabase | Supabase internal | Daily PITR | Current |
| bms-1 container volumes | NONE | — | NONE | Never |
| bms-1 PostgreSQL (host-native) | NONE | — | NONE | Never |
| bms-1 Redis (host-native) | NONE | — | NONE | Never |
| vps-i1 Grafana config | Git repo | radieu/p24-infra | On commit | Current |
| Claude agent credentials | /home/claude-runner/.claude/ | VPS local | NONE | N/A (rotates) |
Components with NO backup are highlighted above. MongoDB rs0 and bms-1 container volumes represent the most critical unmitigated data-loss risk.
4. Failure Scenarios and Recovery Steps
4.1 bms-2 (MongoDB PRIMARY) failure
Symptoms: Pinbox24 API errors, MongoDB write failures, rs.status() shows bms-2 DOWN.
Recovery:
- Do nothing immediately — rs0 has automatic failover
- bms-3 (SECONDARY) + bms-4 (ARBITER vote) have quorum (2/3 votes)
- bms-3 will be elected PRIMARY within ~10-30 seconds
- Verify:
ssh ubuntu@51.68.155.224 "mongosh --eval 'rs.status()'"— bms-3 should show PRIMARY - Update application connection strings if they hardcode bms-2’s IP (they should use a replica set URI)
- Diagnose and restore bms-2 — on return, it will sync as SECONDARY automatically
- Once bms-2 is healthy, optionally transfer PRIMARY back:
rs.stepDown(60)on bms-3
Risk while bms-2 is down: only 1 full-data member (bms-3) + 1 arbiter. No read replica.
4.2 bms-3 (MongoDB SECONDARY) failure
Symptoms: rs.status() shows bms-3 DOWN; bms-2 still PRIMARY; Pinbox24 continues working.
Recovery:
- rs0 remains fully operational — bms-2 (PRIMARY, 1 vote) + bms-4 (arbiter, 1 vote) = quorum
- No application impact
- Diagnose bms-3 and restore service:
ssh ubuntu@51.68.155.224 "sudo systemctl restart mongod" - bms-3 will resync from bms-2 automatically on restart
- Verify resync lag:
rs.printSecondaryReplicationInfo()on primary
Risk while bms-3 is down: if bms-2 also fails, only bms-4 arbiter remains — rs0 loses quorum and goes read-only.
4.3 bms-4 (ARBITER) failure
Symptoms: rs.status() shows bms-4 DOWN; bms-2 PRIMARY; bms-3 SECONDARY; both still have 1 vote each.
Recovery:
- rs0 remains fully operational — bms-2 + bms-3 have 2 votes (quorum maintained)
- No immediate action required for Pinbox24
- Restore bms-4:
ssh root@54.36.123.110 "sudo systemctl restart mongod" - If bms-4 cannot reach rs0, re-add it as arbiter from the PRIMARY:
mongosh --eval "rs.addArb('54.36.123.110:27017')"
Risk while bms-4 is down: if bms-2 also fails simultaneously, bms-3 has only 1 vote — cannot reach quorum (needs 2). rs0 goes read-only.
4.4 vps-i1 (IONOS, monitoring stack) failure
Symptoms: Grafana/Prometheus/Alertmanager unreachable; no more metric alerts.
Recovery — host still alive (Docker issue):
ssh root@217.154.82.162
cd /opt/p24-infra/monitoring
docker compose ps
docker compose up -dRecovery — host unreachable:
- Check IONOS control panel for host status
- If host needs reboot: use IONOS panel → server restart
- If host is destroyed: provision replacement with Ansible:
ansible-playbook ansible/playbooks/provision-new-vps.yml -i ansible/inventory/ - Prometheus TSDB data may be lost locally (up to 15 days), but historical data (Thanos blocks)
is safe in Wasabi S3 at
p24-infra/thanos/ - Restore
.env.bakfrom Infisicalp24-monitoringproject
Business impact: monitoring is internal tooling — no direct user impact, but alerts stop firing.
4.5 vps-h1 (Hostinger, WAHA gateway) failure
Symptoms: WhatsApp messages not received, WAHA unreachable, fleet incident notifications stop.
Recovery:
ssh root@72.60.32.61
cd /root
docker compose ps
docker compose up -d wahaIf host unreachable: check Hostinger control panel. vps-h1 is a PROTECTED host — no new services should be added during recovery. Role is fixed: WAHA gateway + monitoring agents.
Manual incident handling: while WAHA is offline, incidents must be tracked manually.
Check the Supabase incidents table for open items: supabase.co → Table Editor → incidents.
Business impact: WhatsApp-based fleet incident reports are lost while WAHA is down. No queue — messages sent during downtime are not recoverable.
4.6 bms-1 (Pinbox24 production) failure
Symptoms: api.w4.pinbox24.com, w4.pinbox24.com, api.w3.pinbox24.com unreachable.
Recovery — host alive:
ssh root@94.23.26.113
docker ps # check which containers are stopped
docker start nginx-proxy v42-prod s3-v42-prod mailgun-v42-prodRecovery — host destroyed (no automated failover):
- This is a catastrophic scenario — no DR plan exists for bms-1 total loss.
- The server was provisioned manually (no IaC/Ansible).
- MongoDB data may be partially recovered from bms-3 (rs0 SECONDARY if rs0 is intact)
or from manual dumps in
/rooton bms-1 (if disk is recoverable). - Container images for v3.x and untagged containers (
v41-prod,v32-prod-socket,v32-prod-reso) may be unrecoverable if not exported to a registry before failure. - Estimated recovery time: days to weeks, depending on what is recoverable.
Mitigation actions required before this scenario is resolved:
- Export untagged images to Wasabi or a registry (see bms-1 open tasks)
- Implement automated MongoDB dumps to Wasabi
- Create Ansible playbook for bms-1 provisioning
- Document all container launch parameters
4.7 Supabase (managed service) failure
Symptoms: et-operational-platform database unavailable, Supabase dashboard unreachable.
Recovery:
- Check Supabase status page:
https://status.supabase.com - If regional outage: wait for Supabase recovery (they manage their own DR)
- If project-specific issue: contact Supabase support via dashboard
- Point-in-time recovery is available via Supabase dashboard if data corruption occurs
Business impact: fleet management web app (et-operational-platform) is fully unavailable.
Monitoring Grafana dashboards that query Supabase directly will fail.
4.8 Claude agent auth expiry
Symptoms: Discord alert “Claude auth expired on AI-Dev-XXX” or agent GitHub Actions runs fail with auth errors.
Recovery:
# On local Windows machine:
python d:\tmp\reauth-hstgr.pyThis refreshes OAuth tokens for the affected agent. Credentials stored at
/home/claude-runner/.claude/.credentials.json on each VPS.
5. Escalation and Contact
| Channel | When to use |
|---|---|
Discord #infra-alerts | All P1/P2/P3 — automated alerts land here first |
radieu@gmail.com | P1 data loss or P2 extended outage (>15 min) |
| OVH control panel | bms-1, bms-2, bms-3, bms-4 host-level issues |
| IONOS control panel | vps-i1 host-level issues |
| Hostinger control panel | vps-h1 host-level issues |
| Supabase dashboard | Supabase-managed service issues |
6. DR Test Schedule
Status: Never formally tested as of 2026-06-18.
This is a known gap. The following DR tests should be scheduled:
| Test | Frequency | Last tested |
|---|---|---|
| MongoDB PRIMARY failover (bms-2 stepDown) | Quarterly | Never |
| vps-i1 monitoring stack full restart | Quarterly | Never |
| WAHA gateway restart | Monthly | Never |
| MongoDB dump restore to staging | Quarterly | Never |
| Full bms-1 recovery simulation | Annually | Never |
Recommended first DR test: MongoDB PRIMARY stepDown (safest — automatic failover, no data risk).
Run: mongosh --eval "rs.stepDown(60)" on bms-2, verify bms-3 promotes, then step back.