Disaster Recovery Runbook

Project: p24-infra / Ecotrans fleet platform Last updated: 2026-06-18 Status: Initial draft — DR has never been formally tested. See §7.


1. RTO / RPO Targets

ComponentRTO (max downtime)RPO (max data loss)Notes
MongoDB rs0 (Pinbox24 prod)Unknown4+ monthsNo automated backup — local dumps only and months old
bms-1 (Pinbox24 app)Unknown4+ monthsNo automated backup — app stateless, but DB is not
vps-i1 monitoring stack1 hour24 hoursPrometheus data in Thanos → Wasabi S3 (2h blocks)
n8n automation (bms-4)2 hours7 days (workflow configs)n8n workflows stored in PostgreSQL on bms-4; manual backup weekly
Supabase (managed)N/A (SLA)Point-in-time (managed)Supabase handles backups; escalate to support
Claude agents (vps-i1, vps-h1, bms-4)30 minN/A (stateless)Re-auth may be needed; see §5.7
vps-h1 WAHA gatewayBest-effortN/A (stateless)Messages may be lost while down; no queue
Cloudflare (DNS/Workers)N/A (SLA)N/AFully managed

Honesty note: RTO/RPO for MongoDB and bms-1 are “unknown” because no recovery has ever been rehearsed. The 4+ month RPO for MongoDB reflects the age of the last known manual dump. Implementing automated backups is the single highest-priority DR gap.


2. DR Incident Classification

LevelDefinitionExampleResponse
P1 — Data LossIrreversible data destruction or corruptionMongoDB disk failure, dropped collectionImmediate all-hands; escalate within 5 min
P2 — Service DownProduction service unavailable; revenue impactbms-1 host crash, Pinbox24 unreachableRespond within 15 min; page radieu@gmail.com
P3 — DegradedPartial failure; some functions unavailableMonitoring stack down, WAHA offlineRespond within 2 hours; Discord alert sufficient

First response for any incident:

  1. Check Discord #infra-alerts channel for automated alerts
  2. SSH to affected server and run docker ps / systemctl status / df -h
  3. Classify the incident (P1/P2/P3)
  4. Follow the relevant scenario below

3. Backup Status (as of 2026-06-18)

ComponentBackup methodLocationFrequencyAge of last backup
MongoDB rs0 (prod)Manual dump/root/w*-2026-* on bms-1NONE4+ months
MongoDB rs0 (replication)bms-3 SECONDARYbms-3 live dataContinuousCurrent (live replica)
Prometheus metricsThanos sidecar → Wasabi S3p24-infra/thanos/2h blocksCurrent
n8n workflows (bms-4)Manual / nonebms-4 PostgreSQL onlyNONE automatedUnknown
Supabase (et-operational-platform)Managed by SupabaseSupabase internalDaily PITRCurrent
bms-1 container volumesNONENONENever
bms-1 PostgreSQL (host-native)NONENONENever
bms-1 Redis (host-native)NONENONENever
vps-i1 Grafana configGit reporadieu/p24-infraOn commitCurrent
Claude agent credentials/home/claude-runner/.claude/VPS localNONEN/A (rotates)

Components with NO backup are highlighted above. MongoDB rs0 and bms-1 container volumes represent the most critical unmitigated data-loss risk.


4. Failure Scenarios and Recovery Steps

4.1 bms-2 (MongoDB PRIMARY) failure

Symptoms: Pinbox24 API errors, MongoDB write failures, rs.status() shows bms-2 DOWN.

Recovery:

  1. Do nothing immediately — rs0 has automatic failover
  2. bms-3 (SECONDARY) + bms-4 (ARBITER vote) have quorum (2/3 votes)
  3. bms-3 will be elected PRIMARY within ~10-30 seconds
  4. Verify: ssh ubuntu@51.68.155.224 "mongosh --eval 'rs.status()'" — bms-3 should show PRIMARY
  5. Update application connection strings if they hardcode bms-2’s IP (they should use a replica set URI)
  6. Diagnose and restore bms-2 — on return, it will sync as SECONDARY automatically
  7. Once bms-2 is healthy, optionally transfer PRIMARY back: rs.stepDown(60) on bms-3

Risk while bms-2 is down: only 1 full-data member (bms-3) + 1 arbiter. No read replica.


4.2 bms-3 (MongoDB SECONDARY) failure

Symptoms: rs.status() shows bms-3 DOWN; bms-2 still PRIMARY; Pinbox24 continues working.

Recovery:

  1. rs0 remains fully operational — bms-2 (PRIMARY, 1 vote) + bms-4 (arbiter, 1 vote) = quorum
  2. No application impact
  3. Diagnose bms-3 and restore service: ssh ubuntu@51.68.155.224 "sudo systemctl restart mongod"
  4. bms-3 will resync from bms-2 automatically on restart
  5. Verify resync lag: rs.printSecondaryReplicationInfo() on primary

Risk while bms-3 is down: if bms-2 also fails, only bms-4 arbiter remains — rs0 loses quorum and goes read-only.


4.3 bms-4 (ARBITER) failure

Symptoms: rs.status() shows bms-4 DOWN; bms-2 PRIMARY; bms-3 SECONDARY; both still have 1 vote each.

Recovery:

  1. rs0 remains fully operational — bms-2 + bms-3 have 2 votes (quorum maintained)
  2. No immediate action required for Pinbox24
  3. Restore bms-4: ssh root@54.36.123.110 "sudo systemctl restart mongod"
  4. If bms-4 cannot reach rs0, re-add it as arbiter from the PRIMARY:
    mongosh --eval "rs.addArb('54.36.123.110:27017')"

Risk while bms-4 is down: if bms-2 also fails simultaneously, bms-3 has only 1 vote — cannot reach quorum (needs 2). rs0 goes read-only.


4.4 vps-i1 (IONOS, monitoring stack) failure

Symptoms: Grafana/Prometheus/Alertmanager unreachable; no more metric alerts.

Recovery — host still alive (Docker issue):

ssh root@217.154.82.162
cd /opt/p24-infra/monitoring
docker compose ps
docker compose up -d

Recovery — host unreachable:

  1. Check IONOS control panel for host status
  2. If host needs reboot: use IONOS panel → server restart
  3. If host is destroyed: provision replacement with Ansible:
    ansible-playbook ansible/playbooks/provision-new-vps.yml -i ansible/inventory/
  4. Prometheus TSDB data may be lost locally (up to 15 days), but historical data (Thanos blocks) is safe in Wasabi S3 at p24-infra/thanos/
  5. Restore .env.bak from Infisical p24-monitoring project

Business impact: monitoring is internal tooling — no direct user impact, but alerts stop firing.


4.5 vps-h1 (Hostinger, WAHA gateway) failure

Symptoms: WhatsApp messages not received, WAHA unreachable, fleet incident notifications stop.

Recovery:

ssh root@72.60.32.61
cd /root
docker compose ps
docker compose up -d waha

If host unreachable: check Hostinger control panel. vps-h1 is a PROTECTED host — no new services should be added during recovery. Role is fixed: WAHA gateway + monitoring agents.

Manual incident handling: while WAHA is offline, incidents must be tracked manually. Check the Supabase incidents table for open items: supabase.co → Table Editor → incidents.

Business impact: WhatsApp-based fleet incident reports are lost while WAHA is down. No queue — messages sent during downtime are not recoverable.


4.6 bms-1 (Pinbox24 production) failure

Symptoms: api.w4.pinbox24.com, w4.pinbox24.com, api.w3.pinbox24.com unreachable.

Recovery — host alive:

ssh root@94.23.26.113
docker ps   # check which containers are stopped
docker start nginx-proxy v42-prod s3-v42-prod mailgun-v42-prod

Recovery — host destroyed (no automated failover):

  1. This is a catastrophic scenario — no DR plan exists for bms-1 total loss.
  2. The server was provisioned manually (no IaC/Ansible).
  3. MongoDB data may be partially recovered from bms-3 (rs0 SECONDARY if rs0 is intact) or from manual dumps in /root on bms-1 (if disk is recoverable).
  4. Container images for v3.x and untagged containers (v41-prod, v32-prod-socket, v32-prod-reso) may be unrecoverable if not exported to a registry before failure.
  5. Estimated recovery time: days to weeks, depending on what is recoverable.

Mitigation actions required before this scenario is resolved:

  • Export untagged images to Wasabi or a registry (see bms-1 open tasks)
  • Implement automated MongoDB dumps to Wasabi
  • Create Ansible playbook for bms-1 provisioning
  • Document all container launch parameters

4.7 Supabase (managed service) failure

Symptoms: et-operational-platform database unavailable, Supabase dashboard unreachable.

Recovery:

  1. Check Supabase status page: https://status.supabase.com
  2. If regional outage: wait for Supabase recovery (they manage their own DR)
  3. If project-specific issue: contact Supabase support via dashboard
  4. Point-in-time recovery is available via Supabase dashboard if data corruption occurs

Business impact: fleet management web app (et-operational-platform) is fully unavailable. Monitoring Grafana dashboards that query Supabase directly will fail.


4.8 Claude agent auth expiry

Symptoms: Discord alert “Claude auth expired on AI-Dev-XXX” or agent GitHub Actions runs fail with auth errors.

Recovery:

# On local Windows machine:
python d:\tmp\reauth-hstgr.py

This refreshes OAuth tokens for the affected agent. Credentials stored at /home/claude-runner/.claude/.credentials.json on each VPS.


5. Escalation and Contact

ChannelWhen to use
Discord #infra-alertsAll P1/P2/P3 — automated alerts land here first
radieu@gmail.comP1 data loss or P2 extended outage (>15 min)
OVH control panelbms-1, bms-2, bms-3, bms-4 host-level issues
IONOS control panelvps-i1 host-level issues
Hostinger control panelvps-h1 host-level issues
Supabase dashboardSupabase-managed service issues

6. DR Test Schedule

Status: Never formally tested as of 2026-06-18.

This is a known gap. The following DR tests should be scheduled:

TestFrequencyLast tested
MongoDB PRIMARY failover (bms-2 stepDown)QuarterlyNever
vps-i1 monitoring stack full restartQuarterlyNever
WAHA gateway restartMonthlyNever
MongoDB dump restore to stagingQuarterlyNever
Full bms-1 recovery simulationAnnuallyNever

Recommended first DR test: MongoDB PRIMARY stepDown (safest — automatic failover, no data risk). Run: mongosh --eval "rs.stepDown(60)" on bms-2, verify bms-3 promotes, then step back.