Spec 01 — Backups for stateful services

Purpose

Today, only Prometheus metrics survive a VPS loss (Thanos → Wasabi). Everything else — n8n workflows, WAHA WhatsApp session, Grafana dashboards, Traccar history, Caddy/Traefik certs — exists in exactly one place. If either VPS goes away, automations stop, certs need re-issuing (with rate-limit risk), and the WhatsApp gateway needs a phone-side QR scan to recover.

Wasabi is already paid for, has free egress, and our total stateful footprint is small. There’s no good reason to keep operating without backups.


Answer: do we back up whole volumes? How much data?

Short answer: No — mostly we use service-native exports (API, pg_dump, acme.json) plus one tar of the volume for the things that have no good export path (WAHA session, Grafana SQLite). Total nightly footprint is ~1–2 GB compressed.

Measured volume sizes (2026-05-11)

Hostinger VPS (vps-h1):

VolumeSizeStrategy
n8n_data4.2 GBpg_dump-style SQLite dump (n8n uses SQLite by default) + nightly REST API export of workflows/credentials
root_waha_sessions11 MBTar the volume — no API export for session state
traefik_data64 KBTar acme.json — small, irreplaceable (cert keys)

IONOS VPS (vps-i1):

VolumeSizeStrategy
monitoring_prometheus_data67 MBSkip — Thanos already ships to Wasabi
monitoring_grafana_data100 MBGrafana API export (dashboards, datasources, folders, users) — restore is JSON-driven
traccar_traccar-db209 MBmysqldump
monitoring_caddy_data132 KBTar (certs + ACME state)
monitoring_uptime_kuma_data<50 MBTar the volume — SQLite + uploaded icons; Kuma 1.x has no clean API export path (see spec 07)
monitoring_loki_data~2 GB at steady stateSkip — 14-day rolling buffer of container logs (spec 02); reconstitutable, low value after 14 d
/root/openclaw385 MBSkip — it’s a git checkout + node_modules; rebuilt from radieu/openclaw
Supabase Pron/aRely on built-in PITR (7 days) + weekly pg_dump to Wasabi for cold storage

Totals:

  • Raw: ~4.7 GB (most of it n8n executions table — see §pruning below)
  • Compressed (zstd -19): est. 800 MB – 1.5 GB per night
  • 30-day retention: ~30–45 GB
  • Wasabi cost at 5.99 €/TB/month: ~0.20 €/month. Negligible.

Why not just tar n8n_data whole?

You can, but:

  1. n8n’s SQLite (database.sqlite) requires either a quiesce (stop the container — 30 s downtime) or a .backup command via the sqlite3 CLI inside the container. A raw tar of a live SQLite file risks a corrupt restore.
  2. The 4.2 GB is dominated by the execution_entity table (run history). Workflow definitions + credentials are kilobytes. We don’t actually need 4 GB of execution history to recover — we need the workflows.
  3. n8n’s REST API gives us decoupled restore: we can import workflows into a fresh n8n instance on a different host without volume-level surgery.

Decision: dump SQLite via .backup (consistent snapshot, no downtime), and export workflows/credentials via REST API. Two backups; the API one is the recovery path; the SQLite one is the safety net.

Pruning (companion task)

Configure n8n execution pruning so the SQLite stays under 1 GB:

N8N_DEFAULT_BINARY_DATA_MODE: filesystem
EXECUTIONS_DATA_PRUNE: "true"
EXECUTIONS_DATA_MAX_AGE: "168"          # 7 days
EXECUTIONS_DATA_PRUNE_MAX_COUNT: 10000

This is a separate, smaller change but should land in the same PR because it materially affects backup size.


Rulebook (operating rules)

  1. Restore drill quarterly. Run scripts/backup-restore-drill.sh (created by this spec) on a scratch Docker host. Restore must produce a working n8n with all current workflows visible. If it fails, treat as P1.
  2. Backup failures page within 24 h. A missing backup file for >24 h opens a severity: critical Discord alert + GH issue. No silent failures.
  3. Never restore over production blindly. Always restore into a scratch container first; diff against current state.
  4. Encryption at rest. Wasabi bucket has server-side encryption on. Backup tarballs are additionally encrypted client-side with age using a key stored in 1Password (out-of-band — see spec 03). If Wasabi creds leak, backups are still confidential.
  5. Retention: 30 daily + 12 monthly. Beyond that, delete — old backups of fast-moving config are not useful.

Architecture

┌─ vps-h1 (Hostinger) ───────────────────────────────────┐
│  /opt/p24-infra/scripts/backup-hstgr.sh  (cron 02:00)  │
│    ├─ docker exec n8n sqlite3 ... .backup → /tmp       │
│    ├─ curl n8n REST → workflows.json, credentials.json │
│    ├─ tar root_waha_sessions, traefik_data             │
│    ├─ age -r $AGE_PUBKEY -o backup.tar.zst.age         │
│    └─ aws s3 cp s3://ecotrans-backups/vps-h1/YYYY-MM-DD/│
└────────────────────────────────────────────────────────┘
                          ↓
                  ┌──────────────┐
                  │ Wasabi S3    │
                  │ ecotrans-    │
                  │  backups     │
                  └──────────────┘
                          ↑
┌─ vps-i1 (IONOS) ───────────────────────────────────────┐
│  /opt/p24-infra/scripts/backup-ionos.sh  (cron 02:30)  │
│    ├─ mysqldump traccar-db                             │
│    ├─ curl grafana API → dashboards.json               │
│    ├─ pg_dump supabase (weekly only, Sundays)          │
│    ├─ tar monitoring_caddy_data                        │
│    ├─ age encrypt, zstd compress                       │
│    └─ aws s3 cp s3://ecotrans-backups/vps-i1/YYYY-MM-DD/│
└────────────────────────────────────────────────────────┘
                          ↓
                Prometheus pushgateway
                          ↓
                Grafana panel: "Last successful backup age"
                          ↓
                Alert: BackupStale (>26h)

Implementation plan

Phase 1 — provisioning (0.5 d)

  1. Create new Wasabi bucket ecotrans-backups (separate from ecotrans-monitoring so retention policies don’t collide).
  2. Generate an age keypair locally; public key committed to monitoring/.env.example, private key stored in 1Password + on each VPS at /root/.age/backup.key (mode 600).
  3. Add GH Secrets: WASABI_BACKUP_ACCESS_KEY, WASABI_BACKUP_SECRET_KEY (scoped IAM, write-only to ecotrans-backups/*).

Phase 2 — backup scripts (0.5 d)

Files to create:

  • scripts/backup-common.sh — shared functions: log(), notify_discord(), push_metric(), encrypt_and_upload()
  • scripts/backup-hstgr.sh — runs on Hostinger
  • scripts/backup-ionos.sh — runs on IONOS
  • scripts/backup-restore-drill.sh — restores latest backup into a scratch Docker network for verification
  • monitoring/prometheus/rules/backups.yml — alerts:
    • BackupStale(time() - backup_last_success_timestamp) > 93600 (26 h, severity: critical)
    • BackupSizeRegression — backup size dropped >50% vs 7-day average (severity: warning)

Cron entries installed by scripts/install-ionos.sh and a new scripts/install-hstgr.sh:

0 2 * * * /opt/p24-infra/scripts/backup-hstgr.sh   # Hostinger
30 2 * * * /opt/p24-infra/scripts/backup-ionos.sh  # IONOS
0 3 * * 0 /opt/p24-infra/scripts/backup-supabase.sh # weekly logical dump

Phase 3 — pruning + restore drill (0.5 d)

  1. Add n8n pruning env vars to hostinger/docker-compose.yml, restart container.
  2. Run backup-restore-drill.sh on a scratch host (your laptop or a temporary Hostinger container). Document any restore-path issues found.
  3. Add quarterly calendar reminder: “Restore drill — p24-infra issue + PR”.

Acceptance criteria

  • aws s3 ls s3://ecotrans-backups/vps-h1/ shows entries for the last 3 nights (run after 3 days)
  • aws s3 ls s3://ecotrans-backups/vps-i1/ same
  • Backup files are encrypted (file backup.tar.zst.age → “age encrypted file”)
  • Grafana dashboard “Backups” shows last_success_age_hours < 26 for both VPSes
  • scripts/backup-restore-drill.sh exits 0 and prints RESTORE OK — 12 workflows, 8 credentials, 1 WAHA session
  • Stopping the backup script and waiting 27 h triggers a BackupStale Discord alert + GH issue
  • docs/runbook.md has a new ”## Alert: BackupStale” section
  • n8n_data volume size <1.5 GB after pruning takes effect (verify after 7 days)

Cost impact

ItemCost
Wasabi storage (45 GB peak × 5.99 €/TB)~0.30 €/month
Wasabi requestsincluded
Bucket creationone-off

Total: ~0.30 €/month. Effectively free.


Back-out plan

  1. Remove cron entries from both VPSes.
  2. Delete ecotrans-backups bucket contents and bucket itself.
  3. Revert monitoring/docker-compose.yml n8n pruning (only if it caused issues — pruning is independently valuable).
  4. Backup tarballs remaining on Wasabi cost €0 once deleted; no other artefacts.

No data loss from back-out — we’re only removing new backups, not changing existing service state (except n8n pruning, which is reversible by toggling the env var).


Risks / open questions

  • Q: Should we also back up the Supabase project metadata (RLS policies, edge functions)? A: Yes, but separate spec — Supabase CLI has db pull which gives us a full schema dump. Add to a later PR.
  • Q: What about the et-operational-platform repo itself? A: It’s on GitHub — that’s the backup. We just need to verify the GitHub org has 2FA enforced + recovery codes saved.
  • Risk: A backup script bug could silently produce empty tarballs. Mitigation: BackupSizeRegression alert + nightly file-listing of tarball contents in Discord (just the file count + total bytes — not the contents).
  • Risk: age private key loss = backups unreadable. Mitigation: store key in 1Password and on each VPS and locally. If we lose all three we have bigger problems.

Bootstrap (post-merge deployment)

The artifacts PR (#56) ships scripts + alert rules only. Deployment is the manual checklist below. Run from your dev laptop (radieu, Windows) with PowerShell + Git Bash + gh + aws-cli. Reckoned time: ~45 min end-to-end.

Step 1 — Create Wasabi bucket ecotrans-backups

Wasabi console: https://console.wasabisys.com → Buckets → Create Bucket.

  • Name: ecotrans-backups
  • Region: eu-central-1 (Frankfurt) — same region as ecotrans-monitoring
  • Bucket logging: off
  • Object versioning: off (we manage retention via lifecycle rules)
  • Object lock: off
  • Default encryption: AES-256 (server-side, on)

Then add a lifecycle rule from the bucket’s Policies → Lifecycle tab:

  • Rule name: retention
  • Prefix: (empty — applies to whole bucket)
  • Action: Delete current versions after 30 days, except prefixes vps-i1/supabase-weekly/ and vps-*/monthly/ (override to 365 days for those)

Wasabi docs: https://docs.wasabi.com/v1/docs/lifecycle-configuration

Step 2 — Create IAM user scoped to ecotrans-backups/*

Wasabi console → IAM → Users → Create User → name p24-backup-writer → Programmatic access.

Attach an inline policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::ecotrans-backups",
        "arn:aws:s3:::ecotrans-backups/*"
      ]
    }
  ]
}

Save the access/secret key pair to 1Password under p24-infra / Wasabi / p24-backup-writer.

Step 3 — Generate age keypair on dev machine

On your Windows dev box (Git Bash or WSL):

mkdir -p ~/.age
age-keygen -o ~/.age/backup-personal.key
chmod 600 ~/.age/backup-personal.key
grep -oP 'public key: \K.*' ~/.age/backup-personal.key

Example output (do not use this value — it’s illustrative):

# created: 2026-05-12T15:42:01+02:00
# public key: age1qx4zlpc6kxak7s9d4d4qz8xdt5ny5gdv90nsfrwlhdp99eq4j2eqf3w0qz
AGE-SECRET-KEY-1G8R...

Save the secret key content to 1Password under p24-infra / age / dev-personal. Save the public key — you’ll commit it later under spec 03 sops setup.

Step 4 — Generate age keypair on each VPS

For each VPS (vps-i1 = 217.154.82.162, vps-h1 = 72.60.32.61):

ssh root@<vps> "mkdir -p /root/.age && \
  age-keygen -o /root/.age/backup.key && \
  chmod 600 /root/.age/backup.key && \
  grep -oP 'public key: \K.*' /root/.age/backup.key"

Capture each public key. Save the private key contents (full file body) to 1Password under p24-infra / age / vps-i1 and p24-infra / age / vps-h1. Do not download the file unencrypted to your laptop.

Note: in the script paths we treat $AGE_PUBKEY as the recipient for encryption. Each VPS encrypts to its own public key + your personal key (we’ll add the personal one when spec 03 lands a recipients file). For now, set $AGE_PUBKEY on each VPS to the VPS’s own public key so it can encrypt for itself; humans recover with the matching private key from 1Password.

Step 5 — Recipient list (parked until spec 03)

For now keep the 3 public keys (dev, vps-i1, vps-h1) in 1Password. Spec 03 (sops setup) will commit them to monitoring/.age-recipients.txt so any backup is decryptable by any of the three identities.

Step 6 — Push GitHub Secrets

gh secret set WASABI_BACKUP_ACCESS_KEY --repo radieu/p24-infra
gh secret set WASABI_BACKUP_SECRET_KEY --repo radieu/p24-infra
gh secret set N8N_API_KEY              --repo radieu/p24-infra
gh secret set GRAFANA_API_TOKEN        --repo radieu/p24-infra
gh secret set SUPABASE_DB_PASSWORD     --repo radieu/p24-infra
gh secret set AGE_PUBKEY               --repo radieu/p24-infra  # dev pubkey (recovery recipient)

Each command prompts for the value — paste from 1Password.

How to obtain each value:

  • N8N_API_KEY — n8n UI → Settings → API → Create API key. Scope: read+write workflows.
  • GRAFANA_API_TOKEN — Grafana UI → Administration → Service Accounts → New service account backup-reader → Add token. Role: Admin (we need to list+read all dashboards).
  • SUPABASE_DB_PASSWORD — Supabase dashboard → Settings → Database → Connection string → Direct connection → password. Already in 1Password.

Step 7 — Install scripts on each VPS

From your dev box, sync the scripts:

# vps-h1 (Hostinger)
ssh root@72.60.32.61 'cd /opt/p24-infra && git pull --ff-only origin main'
 
# vps-i1 (IONOS)
ssh root@217.154.82.162 'cd /opt/p24-infra && git pull --ff-only origin main'

(The repo lives at /opt/p24-infra on both VPSes per CLAUDE.md; the PreToolUse hook will keep them current after this.)

Then provision the env file on each host:

# vps-h1
ssh root@72.60.32.61 'cat > /root/.backup-env' <<'EOF'
AGE_PUBKEY=age1...                                              # vps-h1 own pubkey (Step 4)
WASABI_BACKUP_ACCESS_KEY=...                                    # from Step 2
WASABI_BACKUP_SECRET_KEY=...
P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL=https://discord.com/api/webhooks/...
N8N_API_KEY=...                                                 # from n8n UI
EOF
ssh root@72.60.32.61 'chmod 600 /root/.backup-env'
 
# vps-i1 — analogous, plus MYSQL_ROOT_PASSWORD, GRAFANA_API_TOKEN, SUPABASE_DB_PASSWORD

Make sure each VPS has the required CLI tooling installed:

# Both VPSes need: age, zstd, aws-cli, jq (used by drill), python3 (used by ionos)
# Ubuntu (vps-h1):
ssh root@72.60.32.61 'apt-get update && apt-get install -y age zstd awscli python3'
# AlmaLinux (vps-i1):
ssh root@217.154.82.162 'dnf install -y age zstd awscli python3 postgresql' # postgresql for pg_dump

Step 8 — Install cron entries

# vps-h1 — nightly Hostinger backup at 02:00 UTC
ssh root@72.60.32.61 'crontab -l 2>/dev/null | { cat; echo "0 2 * * * /opt/p24-infra/scripts/backup-hstgr.sh >> /var/log/p24-backup.log 2>&1"; } | crontab -'
 
# vps-i1 — nightly IONOS backup at 02:30 UTC + weekly Supabase dump Sundays 03:00 UTC
ssh root@217.154.82.162 'crontab -l 2>/dev/null | { cat; echo "30 2 * * * /opt/p24-infra/scripts/backup-ionos.sh >> /var/log/p24-backup.log 2>&1"; echo "0 3 * * 0 /opt/p24-infra/scripts/backup-supabase.sh >> /var/log/p24-backup.log 2>&1"; } | crontab -'

Verify:

ssh root@72.60.32.61 'crontab -l | grep backup'
ssh root@217.154.82.162 'crontab -l | grep backup'

Step 9 — First manual runs

ssh root@72.60.32.61   '/opt/p24-infra/scripts/backup-hstgr.sh'
ssh root@217.154.82.162 '/opt/p24-infra/scripts/backup-ionos.sh'
ssh root@217.154.82.162 '/opt/p24-infra/scripts/backup-supabase.sh'

Verify files appear in Wasabi:

AWS_ACCESS_KEY_ID=...                  \
AWS_SECRET_ACCESS_KEY=...              \
aws --endpoint-url https://s3.eu-central-1.wasabisys.com \
    s3 ls s3://ecotrans-backups/ --recursive

Expected output (one entry per host per day, plus the supabase weekly):

2026-05-12 02:00:34   45123456 vps-h1/2026-05-12/backup.tar.zst.age
2026-05-12 02:30:21   13256789 vps-i1/2026-05-12/backup.tar.zst.age
2026-05-12 03:00:08    2456789 vps-i1/supabase-weekly/2026-W19/supabase.sql.gz.age

Step 10 — After 24h: verify Grafana “Last backup age” panel

The BackupStale alert keys off the backup_last_success_timestamp textfile metric. Once node_exporter has scraped that file (within 30s of the script writing it), the metric appears in Prometheus.

Add a Grafana panel (Backups dashboard, separate PR):

(time() - backup_last_success_timestamp) / 3600   # hours since last success, per host

Green threshold: <24h. Red: >26h (alert fires).

Step 11 — After 7 days of pruning: verify n8n_data volume size

ssh root@72.60.32.61 'du -sh /var/lib/docker/volumes/n8n_data'

Expected: under 1.5 GB (down from the current 4.2 GB, since EXECUTIONS_DATA_PRUNE purges executions older than 168h).

If it’s still bloated:

# Confirm pruning is configured (env vars present in the container)
ssh root@72.60.32.61 'docker exec root-n8n-1 env | grep EXECUTIONS_DATA'
# Force one manual prune (n8n does it every hour, but you can trigger by restart)
ssh root@72.60.32.61 'cd /root && docker compose restart n8n'