Spec 01 — Backups for stateful services
Purpose
Today, only Prometheus metrics survive a VPS loss (Thanos → Wasabi). Everything else — n8n workflows, WAHA WhatsApp session, Grafana dashboards, Traccar history, Caddy/Traefik certs — exists in exactly one place. If either VPS goes away, automations stop, certs need re-issuing (with rate-limit risk), and the WhatsApp gateway needs a phone-side QR scan to recover.
Wasabi is already paid for, has free egress, and our total stateful footprint is small. There’s no good reason to keep operating without backups.
Answer: do we back up whole volumes? How much data?
Short answer: No — mostly we use service-native exports (API, pg_dump, acme.json) plus one tar of the volume for the things that have no good export path (WAHA session, Grafana SQLite). Total nightly footprint is ~1–2 GB compressed.
Measured volume sizes (2026-05-11)
Hostinger VPS (vps-h1):
| Volume | Size | Strategy |
|---|---|---|
n8n_data | 4.2 GB | pg_dump-style SQLite dump (n8n uses SQLite by default) + nightly REST API export of workflows/credentials |
root_waha_sessions | 11 MB | Tar the volume — no API export for session state |
traefik_data | 64 KB | Tar acme.json — small, irreplaceable (cert keys) |
IONOS VPS (vps-i1):
| Volume | Size | Strategy |
|---|---|---|
monitoring_prometheus_data | 67 MB | Skip — Thanos already ships to Wasabi |
monitoring_grafana_data | 100 MB | Grafana API export (dashboards, datasources, folders, users) — restore is JSON-driven |
traccar_traccar-db | 209 MB | mysqldump |
monitoring_caddy_data | 132 KB | Tar (certs + ACME state) |
monitoring_uptime_kuma_data | <50 MB | Tar the volume — SQLite + uploaded icons; Kuma 1.x has no clean API export path (see spec 07) |
monitoring_loki_data | ~2 GB at steady state | Skip — 14-day rolling buffer of container logs (spec 02); reconstitutable, low value after 14 d |
/root/openclaw | 385 MB | Skip — it’s a git checkout + node_modules; rebuilt from radieu/openclaw |
| Supabase Pro | n/a | Rely on built-in PITR (7 days) + weekly pg_dump to Wasabi for cold storage |
Totals:
- Raw: ~4.7 GB (most of it n8n executions table — see §pruning below)
- Compressed (zstd -19): est. 800 MB – 1.5 GB per night
- 30-day retention: ~30–45 GB
- Wasabi cost at 5.99 €/TB/month: ~0.20 €/month. Negligible.
Why not just tar n8n_data whole?
You can, but:
- n8n’s SQLite (
database.sqlite) requires either a quiesce (stop the container — 30 s downtime) or a.backupcommand via the sqlite3 CLI inside the container. A raw tar of a live SQLite file risks a corrupt restore. - The 4.2 GB is dominated by the
execution_entitytable (run history). Workflow definitions + credentials are kilobytes. We don’t actually need 4 GB of execution history to recover — we need the workflows. - n8n’s REST API gives us decoupled restore: we can import workflows into a fresh n8n instance on a different host without volume-level surgery.
Decision: dump SQLite via .backup (consistent snapshot, no downtime), and export workflows/credentials via REST API. Two backups; the API one is the recovery path; the SQLite one is the safety net.
Pruning (companion task)
Configure n8n execution pruning so the SQLite stays under 1 GB:
N8N_DEFAULT_BINARY_DATA_MODE: filesystem
EXECUTIONS_DATA_PRUNE: "true"
EXECUTIONS_DATA_MAX_AGE: "168" # 7 days
EXECUTIONS_DATA_PRUNE_MAX_COUNT: 10000This is a separate, smaller change but should land in the same PR because it materially affects backup size.
Rulebook (operating rules)
- Restore drill quarterly. Run
scripts/backup-restore-drill.sh(created by this spec) on a scratch Docker host. Restore must produce a working n8n with all current workflows visible. If it fails, treat as P1. - Backup failures page within 24 h. A missing backup file for >24 h opens a
severity: criticalDiscord alert + GH issue. No silent failures. - Never restore over production blindly. Always restore into a scratch container first; diff against current state.
- Encryption at rest. Wasabi bucket has server-side encryption on. Backup tarballs are additionally encrypted client-side with
ageusing a key stored in1Password(out-of-band — see spec 03). If Wasabi creds leak, backups are still confidential. - Retention: 30 daily + 12 monthly. Beyond that, delete — old backups of fast-moving config are not useful.
Architecture
┌─ vps-h1 (Hostinger) ───────────────────────────────────┐
│ /opt/p24-infra/scripts/backup-hstgr.sh (cron 02:00) │
│ ├─ docker exec n8n sqlite3 ... .backup → /tmp │
│ ├─ curl n8n REST → workflows.json, credentials.json │
│ ├─ tar root_waha_sessions, traefik_data │
│ ├─ age -r $AGE_PUBKEY -o backup.tar.zst.age │
│ └─ aws s3 cp s3://ecotrans-backups/vps-h1/YYYY-MM-DD/│
└────────────────────────────────────────────────────────┘
↓
┌──────────────┐
│ Wasabi S3 │
│ ecotrans- │
│ backups │
└──────────────┘
↑
┌─ vps-i1 (IONOS) ───────────────────────────────────────┐
│ /opt/p24-infra/scripts/backup-ionos.sh (cron 02:30) │
│ ├─ mysqldump traccar-db │
│ ├─ curl grafana API → dashboards.json │
│ ├─ pg_dump supabase (weekly only, Sundays) │
│ ├─ tar monitoring_caddy_data │
│ ├─ age encrypt, zstd compress │
│ └─ aws s3 cp s3://ecotrans-backups/vps-i1/YYYY-MM-DD/│
└────────────────────────────────────────────────────────┘
↓
Prometheus pushgateway
↓
Grafana panel: "Last successful backup age"
↓
Alert: BackupStale (>26h)
Implementation plan
Phase 1 — provisioning (0.5 d)
- Create new Wasabi bucket
ecotrans-backups(separate fromecotrans-monitoringso retention policies don’t collide). - Generate an
agekeypair locally; public key committed tomonitoring/.env.example, private key stored in 1Password + on each VPS at/root/.age/backup.key(mode 600). - Add GH Secrets:
WASABI_BACKUP_ACCESS_KEY,WASABI_BACKUP_SECRET_KEY(scoped IAM, write-only toecotrans-backups/*).
Phase 2 — backup scripts (0.5 d)
Files to create:
scripts/backup-common.sh— shared functions:log(),notify_discord(),push_metric(),encrypt_and_upload()scripts/backup-hstgr.sh— runs on Hostingerscripts/backup-ionos.sh— runs on IONOSscripts/backup-restore-drill.sh— restores latest backup into a scratch Docker network for verificationmonitoring/prometheus/rules/backups.yml— alerts:BackupStale—(time() - backup_last_success_timestamp) > 93600(26 h, severity: critical)BackupSizeRegression— backup size dropped >50% vs 7-day average (severity: warning)
Cron entries installed by scripts/install-ionos.sh and a new scripts/install-hstgr.sh:
0 2 * * * /opt/p24-infra/scripts/backup-hstgr.sh # Hostinger
30 2 * * * /opt/p24-infra/scripts/backup-ionos.sh # IONOS
0 3 * * 0 /opt/p24-infra/scripts/backup-supabase.sh # weekly logical dumpPhase 3 — pruning + restore drill (0.5 d)
- Add n8n pruning env vars to
hostinger/docker-compose.yml, restart container. - Run
backup-restore-drill.shon a scratch host (your laptop or a temporary Hostinger container). Document any restore-path issues found. - Add quarterly calendar reminder: “Restore drill — p24-infra issue + PR”.
Acceptance criteria
-
aws s3 ls s3://ecotrans-backups/vps-h1/shows entries for the last 3 nights (run after 3 days) -
aws s3 ls s3://ecotrans-backups/vps-i1/same - Backup files are encrypted (
file backup.tar.zst.age→ “age encrypted file”) - Grafana dashboard “Backups” shows
last_success_age_hours < 26for both VPSes -
scripts/backup-restore-drill.shexits 0 and printsRESTORE OK — 12 workflows, 8 credentials, 1 WAHA session - Stopping the backup script and waiting 27 h triggers a
BackupStaleDiscord alert + GH issue -
docs/runbook.mdhas a new ”## Alert: BackupStale” section - n8n_data volume size <1.5 GB after pruning takes effect (verify after 7 days)
Cost impact
| Item | Cost |
|---|---|
| Wasabi storage (45 GB peak × 5.99 €/TB) | ~0.30 €/month |
| Wasabi requests | included |
| Bucket creation | one-off |
Total: ~0.30 €/month. Effectively free.
Back-out plan
- Remove cron entries from both VPSes.
- Delete
ecotrans-backupsbucket contents and bucket itself. - Revert
monitoring/docker-compose.ymln8n pruning (only if it caused issues — pruning is independently valuable). - Backup tarballs remaining on Wasabi cost €0 once deleted; no other artefacts.
No data loss from back-out — we’re only removing new backups, not changing existing service state (except n8n pruning, which is reversible by toggling the env var).
Risks / open questions
- Q: Should we also back up the Supabase project metadata (RLS policies, edge functions)? A: Yes, but separate spec — Supabase CLI has
db pullwhich gives us a full schema dump. Add to a later PR. - Q: What about the
et-operational-platformrepo itself? A: It’s on GitHub — that’s the backup. We just need to verify the GitHub org has 2FA enforced + recovery codes saved. - Risk: A backup script bug could silently produce empty tarballs. Mitigation:
BackupSizeRegressionalert + nightly file-listing of tarball contents in Discord (just the file count + total bytes — not the contents). - Risk:
ageprivate key loss = backups unreadable. Mitigation: store key in 1Password and on each VPS and locally. If we lose all three we have bigger problems.
Bootstrap (post-merge deployment)
The artifacts PR (#56) ships scripts + alert rules only. Deployment is the manual checklist below. Run from your dev laptop (radieu, Windows) with PowerShell + Git Bash +
gh+aws-cli. Reckoned time: ~45 min end-to-end.
Step 1 — Create Wasabi bucket ecotrans-backups
Wasabi console: https://console.wasabisys.com → Buckets → Create Bucket.
- Name:
ecotrans-backups - Region:
eu-central-1(Frankfurt) — same region asecotrans-monitoring - Bucket logging: off
- Object versioning: off (we manage retention via lifecycle rules)
- Object lock: off
- Default encryption: AES-256 (server-side, on)
Then add a lifecycle rule from the bucket’s Policies → Lifecycle tab:
- Rule name:
retention - Prefix: (empty — applies to whole bucket)
- Action: Delete current versions after
30days, except prefixesvps-i1/supabase-weekly/andvps-*/monthly/(override to365days for those)
Wasabi docs: https://docs.wasabi.com/v1/docs/lifecycle-configuration
Step 2 — Create IAM user scoped to ecotrans-backups/*
Wasabi console → IAM → Users → Create User → name p24-backup-writer → Programmatic access.
Attach an inline policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::ecotrans-backups",
"arn:aws:s3:::ecotrans-backups/*"
]
}
]
}Save the access/secret key pair to 1Password under p24-infra / Wasabi / p24-backup-writer.
Step 3 — Generate age keypair on dev machine
On your Windows dev box (Git Bash or WSL):
mkdir -p ~/.age
age-keygen -o ~/.age/backup-personal.key
chmod 600 ~/.age/backup-personal.key
grep -oP 'public key: \K.*' ~/.age/backup-personal.keyExample output (do not use this value — it’s illustrative):
# created: 2026-05-12T15:42:01+02:00
# public key: age1qx4zlpc6kxak7s9d4d4qz8xdt5ny5gdv90nsfrwlhdp99eq4j2eqf3w0qz
AGE-SECRET-KEY-1G8R...Save the secret key content to 1Password under p24-infra / age / dev-personal. Save the public key — you’ll commit it later under spec 03 sops setup.
Step 4 — Generate age keypair on each VPS
For each VPS (vps-i1 = 217.154.82.162, vps-h1 = 72.60.32.61):
ssh root@<vps> "mkdir -p /root/.age && \
age-keygen -o /root/.age/backup.key && \
chmod 600 /root/.age/backup.key && \
grep -oP 'public key: \K.*' /root/.age/backup.key"Capture each public key. Save the private key contents (full file body) to 1Password under p24-infra / age / vps-i1 and p24-infra / age / vps-h1. Do not download the file unencrypted to your laptop.
Note: in the script paths we treat
$AGE_PUBKEYas the recipient for encryption. Each VPS encrypts to its own public key + your personal key (we’ll add the personal one when spec 03 lands a recipients file). For now, set$AGE_PUBKEYon each VPS to the VPS’s own public key so it can encrypt for itself; humans recover with the matching private key from 1Password.
Step 5 — Recipient list (parked until spec 03)
For now keep the 3 public keys (dev, vps-i1, vps-h1) in 1Password. Spec 03 (sops setup) will commit them to monitoring/.age-recipients.txt so any backup is decryptable by any of the three identities.
Step 6 — Push GitHub Secrets
gh secret set WASABI_BACKUP_ACCESS_KEY --repo radieu/p24-infra
gh secret set WASABI_BACKUP_SECRET_KEY --repo radieu/p24-infra
gh secret set N8N_API_KEY --repo radieu/p24-infra
gh secret set GRAFANA_API_TOKEN --repo radieu/p24-infra
gh secret set SUPABASE_DB_PASSWORD --repo radieu/p24-infra
gh secret set AGE_PUBKEY --repo radieu/p24-infra # dev pubkey (recovery recipient)Each command prompts for the value — paste from 1Password.
How to obtain each value:
- N8N_API_KEY — n8n UI → Settings → API → Create API key. Scope: read+write workflows.
- GRAFANA_API_TOKEN — Grafana UI → Administration → Service Accounts → New service account
backup-reader→ Add token. Role: Admin (we need to list+read all dashboards). - SUPABASE_DB_PASSWORD — Supabase dashboard → Settings → Database → Connection string → Direct connection → password. Already in 1Password.
Step 7 — Install scripts on each VPS
From your dev box, sync the scripts:
# vps-h1 (Hostinger)
ssh root@72.60.32.61 'cd /opt/p24-infra && git pull --ff-only origin main'
# vps-i1 (IONOS)
ssh root@217.154.82.162 'cd /opt/p24-infra && git pull --ff-only origin main'(The repo lives at /opt/p24-infra on both VPSes per CLAUDE.md; the PreToolUse hook will keep them current after this.)
Then provision the env file on each host:
# vps-h1
ssh root@72.60.32.61 'cat > /root/.backup-env' <<'EOF'
AGE_PUBKEY=age1... # vps-h1 own pubkey (Step 4)
WASABI_BACKUP_ACCESS_KEY=... # from Step 2
WASABI_BACKUP_SECRET_KEY=...
P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL=https://discord.com/api/webhooks/...
N8N_API_KEY=... # from n8n UI
EOF
ssh root@72.60.32.61 'chmod 600 /root/.backup-env'
# vps-i1 — analogous, plus MYSQL_ROOT_PASSWORD, GRAFANA_API_TOKEN, SUPABASE_DB_PASSWORDMake sure each VPS has the required CLI tooling installed:
# Both VPSes need: age, zstd, aws-cli, jq (used by drill), python3 (used by ionos)
# Ubuntu (vps-h1):
ssh root@72.60.32.61 'apt-get update && apt-get install -y age zstd awscli python3'
# AlmaLinux (vps-i1):
ssh root@217.154.82.162 'dnf install -y age zstd awscli python3 postgresql' # postgresql for pg_dumpStep 8 — Install cron entries
# vps-h1 — nightly Hostinger backup at 02:00 UTC
ssh root@72.60.32.61 'crontab -l 2>/dev/null | { cat; echo "0 2 * * * /opt/p24-infra/scripts/backup-hstgr.sh >> /var/log/p24-backup.log 2>&1"; } | crontab -'
# vps-i1 — nightly IONOS backup at 02:30 UTC + weekly Supabase dump Sundays 03:00 UTC
ssh root@217.154.82.162 'crontab -l 2>/dev/null | { cat; echo "30 2 * * * /opt/p24-infra/scripts/backup-ionos.sh >> /var/log/p24-backup.log 2>&1"; echo "0 3 * * 0 /opt/p24-infra/scripts/backup-supabase.sh >> /var/log/p24-backup.log 2>&1"; } | crontab -'Verify:
ssh root@72.60.32.61 'crontab -l | grep backup'
ssh root@217.154.82.162 'crontab -l | grep backup'Step 9 — First manual runs
ssh root@72.60.32.61 '/opt/p24-infra/scripts/backup-hstgr.sh'
ssh root@217.154.82.162 '/opt/p24-infra/scripts/backup-ionos.sh'
ssh root@217.154.82.162 '/opt/p24-infra/scripts/backup-supabase.sh'Verify files appear in Wasabi:
AWS_ACCESS_KEY_ID=... \
AWS_SECRET_ACCESS_KEY=... \
aws --endpoint-url https://s3.eu-central-1.wasabisys.com \
s3 ls s3://ecotrans-backups/ --recursiveExpected output (one entry per host per day, plus the supabase weekly):
2026-05-12 02:00:34 45123456 vps-h1/2026-05-12/backup.tar.zst.age
2026-05-12 02:30:21 13256789 vps-i1/2026-05-12/backup.tar.zst.age
2026-05-12 03:00:08 2456789 vps-i1/supabase-weekly/2026-W19/supabase.sql.gz.ageStep 10 — After 24h: verify Grafana “Last backup age” panel
The BackupStale alert keys off the backup_last_success_timestamp textfile metric. Once node_exporter has scraped that file (within 30s of the script writing it), the metric appears in Prometheus.
Add a Grafana panel (Backups dashboard, separate PR):
(time() - backup_last_success_timestamp) / 3600 # hours since last success, per hostGreen threshold: <24h. Red: >26h (alert fires).
Step 11 — After 7 days of pruning: verify n8n_data volume size
ssh root@72.60.32.61 'du -sh /var/lib/docker/volumes/n8n_data'Expected: under 1.5 GB (down from the current 4.2 GB, since EXECUTIONS_DATA_PRUNE purges executions older than 168h).
If it’s still bloated:
# Confirm pruning is configured (env vars present in the container)
ssh root@72.60.32.61 'docker exec root-n8n-1 env | grep EXECUTIONS_DATA'
# Force one manual prune (n8n does it every hour, but you can trigger by restart)
ssh root@72.60.32.61 'cd /root && docker compose restart n8n'