Cloud Services — Operations Workbook

Covers: Cloudflare DNS, GitHub, Vercel, Wasabi S3, Mailgun EU. All are external SaaS dependencies of the p24-infra stack.


Cloudflare DNS

Architecture

Cloudflare manages DNS for zintegrowana.online (zone ID 57cb3d8f24c7cc319fb703394edc7b87, Free plan, DNS-only — no Cloudflare proxy). All infrastructure subdomains follow the pattern {service}.{vps-label}.infra.zintegrowana.online.

zintegrowana.online (Cloudflare Free, DNS-only)

├── *.vps-i1.infra.zintegrowana.online  →  A  217.154.82.162  (IONOS VPS)
├── *.vps-h1.infra.zintegrowana.online  →  A  72.60.32.61     (Hostinger VPS)
└── n8n-cloud.infra.zintegrowana.online →  CNAME p24.app.n8n.cloud

Wildcard A records cover all subdomains on each VPS — adding a new service requires only a Caddy/Traefik config change, no DNS change.

DNS manager CLI (any VPS with CF_API_TOKEN + CF_ZONE_ID in env):

python3 /opt/p24-infra/scripts/dns-manager.py list
python3 /opt/p24-infra/scripts/dns-manager.py upsert <name> <ip>
python3 /opt/p24-infra/scripts/dns-manager.py delete <name>

Config Management

ItemManaged via
Wildcard A recordsdns-manager.py + Cloudflare API
API token CF_API_TOKENCloudflare dashboard → My Profile → API Tokens
Scoped token CLOUDFLARE_TOKEN_ZINTEGROWANASame — restricted to DNS edit on zintegrowana.online
Zone config (TTL, security settings)Cloudflare dashboard (manual)

Zone config is minimal (Free plan, DNS-only). Record state is declarative and re-creatable from script. Zone ID is not a secret — committed to CLAUDE.md.

Backup

Export current DNS records at any time:

curl -s -H "Authorization: Bearer $CF_API_TOKEN" \
  "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records?per_page=100" \
  | python3 -m json.tool > /tmp/cloudflare-dns-export-$(date +%F).json

Run this before any bulk DNS change. Save the output to docs/backups/ or upload to Wasabi. The two wildcard records are documented in CLAUDE.md — trivial to re-create manually.

Restore

Records lost, zone still exists:

# Re-add wildcard records
python3 /opt/p24-infra/scripts/dns-manager.py upsert "*.vps-i1.infra.zintegrowana.online" 217.154.82.162
python3 /opt/p24-infra/scripts/dns-manager.py upsert "*.vps-h1.infra.zintegrowana.online" 72.60.32.61

Recovery time: < 5 minutes. DNS propagation via Cloudflare is near-instant (< 60s TTL).

Zone deleted (worst case):

  1. Re-add zintegrowana.online to Cloudflare via dashboard.
  2. Update nameservers at registrar to point to Cloudflare nameservers.
  3. Nameserver propagation: up to 24h.
  4. Re-apply all DNS records via dns-manager.py.

Healthcheck / Monitoring

No Prometheus alert. Manual check during incident:

dig grafana.vps-i1.infra.zintegrowana.online @1.1.1.1 +short
# Expected: 217.154.82.162
 
dig n8n.vps-h1.infra.zintegrowana.online @1.1.1.1 +short
# Expected: 72.60.32.61

Blackbox exporter DNS probe can be added to monitoring/prometheus/blackbox.yml if this becomes a pain point.

Password / Credential Rotation

CredentialTracked entryRotation frequency
Cloudflare account passwordCloudflare dashboard + password manager180d
CLOUDFLARE_TOKEN_ZINTEGROWANA (DNS-edit scope)dev_r_servicescloudflare-dns180d
CF_API_TOKEN (broader scope)dev_r_servicescloudflare-dns180d

To rotate API token:

  1. Cloudflare dashboard → My Profile → API Tokens → Delete old token → Create new.
  2. Update CF_API_TOKEN + CLOUDFLARE_TOKEN_ZINTEGROWANA in:
    • monitoring/.env on vps-i1 (via SSH)
    • monitoring/.env on vps-h1 (via SSH)
    • GitHub Secrets: gh secret set CF_API_TOKEN -b "<new>" -R radieu/p24-infra
    • .env.local on local workstation
  3. Log rotation in docs/secrets-rotation-log.md.

GitHub

Architecture

GitHub (github.com/radieu) hosts code, CI/CD via GitHub Actions, issue tracking, and PR reviews for both radieu/p24-infra and radieu/et-operational-platform. Self-hosted runners are not used — all Actions run on GitHub-hosted runners.

VPS AI agents (AI-Dev-IO1, AI-Dev-HS1) are collaborators with write access to both repos.

radieu/p24-infra              — infra config, monitoring stack, exporters, Ansible
radieu/et-operational-platform — Next.js frontend + backend

Config Management

ItemLocation
Repo code + historyGit (distributed — local + VPS clones)
GitHub Actions workflows.github/workflows/ in each repo
Actions secretsGitHub repository Secrets UI (no automated export)
Branch protection rulesGitHub UI

Actions secrets have no export API. The authoritative copy of each secret value is in .env.local on the local workstation. Any new secret added to GitHub must also be added to .env.local.

Backup

All code is replicated across:

  • Local workstation (d:\code_2026\p24-infra)
  • IONOS VPS (/opt/p24-infra)
  • Hostinger VPS (/opt/p24-infra)
  • GitHub itself

Loss of GitHub access does not mean data loss — work from any local clone. Actions secrets are not replicated automatically: the only backup is .env.local on the local workstation.

Restore

GitHub outage: Work offline from local clone. Push when service recovers.

Repo accidentally deleted: Contact GitHub support with account credentials. Recovery window: 90 days (GitHub trash policy). In parallel, push from a local clone to a new repo.

Actions secret lost: Restore from .env.local → GitHub UI or gh secret set.

Healthcheck / Monitoring

No Prometheus probe. GitHub provides its own status page at githubstatus.com.

Manual check if CI is broken:

gh run list --repo radieu/p24-infra --limit 5

Password / Credential Rotation

CredentialTracked entryRotation frequency
GitHub account passwordPassword manager + 2FA180d
GH_TOKEN (PAT — runner registration + API)dev_r_servicesgithub90d
GH_PAT (PAT — health-check workflow)dev_r_servicesgithub90d

To rotate a PAT:

  1. GitHub → Settings → Developer settings → Personal access tokens → Generate new token.
  2. Set same scopes as old token (repo, workflow, write:packages as needed).
  3. Update in:
    • gh secret set GH_TOKEN -b "<new>" -R radieu/p24-infra
    • gh secret set GH_PAT -b "<new>" -R radieu/p24-infra
    • .env.local on local workstation
  4. Delete old token in GitHub UI.
  5. Log rotation in docs/secrets-rotation-log.md.

Vercel

Architecture

Vercel hosts the et-operational-platform Next.js frontend. Deployments are triggered automatically on push to main (production) and on PR branches (preview). The vercel-exporter container on vps-i1 polls the Vercel API every 5 minutes and exposes deployment state as Prometheus metrics.

GitHub push → Vercel build → Production deployment
                              └── vercel-exporter (port 9202) ─► Prometheus ─► Grafana

Production URL: https://et-operational-platform.vercel.app (plus any custom domain configured in Vercel)

Config Management

ItemLocationIn repo?
vercel.jsonet-operational-platform/ repo rootYes
Environment variablesVercel dashboard + .env.local on local workstationDashboard (not in repo)
Project link.vercel/ directory in repoYes
vercel-exporter configmonitoring/exporters/vercel-exporter/Yes

Environment variables set in Vercel must be mirrored in .env.local. Do not rely solely on the Vercel dashboard — it has no export API for secret values.

Backup

DataBackup
Source codeGit repo (GitHub + local clones)
Build artifactsVercel stores last N deployments — available for instant rollback
Environment variables.env.local on local workstation
Project config (vercel.json)Git repo

Restore

Scenario 1: Bad deployment — rollback in Vercel:

# Via CLI
vercel rollback [deployment-url]
 
# Via dashboard: Vercel → Project → Deployments → select target → Promote to Production

Scenario 2: Project accidentally deleted or Vercel account lost:

  1. Create new Vercel project, link to GitHub repo.
  2. Re-add all environment variables from .env.local.
  3. Push to main to trigger first deployment.

Recovery time: < 10 minutes from code if env vars are ready.

Healthcheck / Monitoring

vercel-exporter (port :9202) scrapes https://api.vercel.com/v6/deployments for the last 20 deployments every 5 minutes. Exposes:

  • vercel_deployment_state — gauge by project + deployment URL
  • vercel_deployments_total — count by project + state

Prometheus rule VercelDeploymentFailed alerts if any production deployment enters ERROR state.

Blackbox probe to the production health endpoint:

curl -s https://et-operational-platform.vercel.app/api/health
# Expected: 200 OK

Password / Credential Rotation

CredentialTracked entryRotation frequency
Vercel account passwordPassword manager180d
VERCEL_TOKEN (API token)dev_r_servicesvercel90d

Last rotated: 2026-05-08.

To rotate VERCEL_TOKEN:

  1. Vercel dashboard → Settings → Tokens → Create new token (full access or scoped as needed).
  2. Update in:
    • gh secret set VERCEL_TOKEN -b "<new>" -R radieu/p24-infra
    • gh secret set VERCEL_TOKEN -b "<new>" -R radieu/et-operational-platform
    • .env on vps-i1 (vercel-exporter reads this)
    • .env.local on local workstation
  3. Restart vercel-exporter: docker compose restart vercel-exporter on vps-i1.
  4. Delete old token in Vercel dashboard.
  5. Log rotation in docs/secrets-rotation-log.md.

Wasabi S3

Architecture

Wasabi S3 provides long-term object storage across two regions:

  1. Prometheus metrics — Thanos sidecar uploads 2h TSDB blocks from vps-i1 continuously to s3://ecotrans-monitoring/ (eu-central-1)
  2. Traccar DB backups — nightly mysqldump uploaded to s3://ecotrans-monitoring/traccar/ (eu-central-1)
  3. Grafana volume backups — nightly grafana_data tar uploaded to s3://p24-infra/grafana/ (eu-central-2, via backup-ionos GH Action)
  4. Supabase backup metricsbackup-exporter reads backups/supabase/metrics/backup-status.prom from s3://p24-infra/ (eu-central-2) to expose backup freshness to Prometheus
vps-i1
├── thanos-sidecar ──────────────────────────────► s3://ecotrans-monitoring/         (eu-central-1)
│   (continuous, 2h blocks, prometheus metrics)
├── backup script (nightly) ────────────────────► s3://ecotrans-monitoring/traccar/  (eu-central-1)
│   (Traccar mysqldump)
├── GH Action grafana-backup.yml (nightly) ─────► s3://p24-infra/grafana/            (eu-central-2)
│   (grafana_data volume tar.gz)
└── backup-exporter (reads) ─────────────────────► s3://p24-infra/backups/supabase/metrics/  (eu-central-2)
    (backup-status.prom written by supabase-backup GHA workflow)
 
GitHub Actions (supabase-backup workflow)
└── writes backup-status.prom ──────────────────► s3://p24-infra/backups/supabase/metrics/  (eu-central-2)

Buckets:

BucketRegionEndpointPurposeIAM key used
ecotrans-monitoringeu-central-1s3.eu-central-1.wasabisys.comProduction: Thanos metrics + Traccar backupsWASABI_ACCESS_KEY
ecotrans-monitoring-testeu-central-1s3.eu-central-1.wasabisys.comTesting only — never use for production dataWASABI_ACCESS_KEY
p24-infraeu-central-2s3.eu-central-2.wasabisys.comGrafana backups + Supabase backup metricsP24_INFRA_WASABI_ACCESS_KEY

IAM users (Wasabi account 100000049371):

IAM userARNKeys stored inBuckets accessed
p24-infraarn:aws:iam::100000049371:user/p24-infraP24_INFRA_WASABI_ACCESS_KEY/SECRET_KEYp24-infra (eu-central-2)
(monitoring user)WASABI_ACCESS_KEY/SECRET_KEYecotrans-monitoring (eu-central-1)

There is no cross-region replication. Loss of eu-central-1 affects long-term Prometheus history. Loss of eu-central-2 affects Grafana backup restore capability and Supabase backup monitoring visibility.

Config Management

FileIn repo?Purpose
monitoring/thanos/s3.ymlYes (template)Wasabi config for Thanos (eu-central-1) — credentials injected from .env at runtime
monitoring/.envNo (.env.example only)Contains WASABI_ACCESS_KEY, WASABI_SECRET_KEY, P24_INFRA_WASABI_ACCESS_KEY, P24_INFRA_WASABI_SECRET_KEY
GH SecretsGH UIWASABI_ACCESS_KEY, WASABI_SECRET_KEY, P24_INFRA_WASABI_ACCESS_KEY, P24_INFRA_WASABI_SECRET_KEY
.env.local (local)NoP24_INFRA_WASABI_ACCESS_KEY, P24_INFRA_WASABI_SECRET_KEY (and monitoring keys)

Do not commit credentials. The s3.yml template in the repo contains placeholders resolved at runtime.

Critical: The backup-exporter container uses P24_INFRA_WASABI_ACCESS_KEY / P24_INFRA_WASABI_SECRET_KEY (eu-central-2, bucket p24-infra). Do NOT use the general WASABI_ACCESS_KEY for it — different region, different bucket.

Backup

Wasabi is the backup target. The bucket itself is not backed up elsewhere. Acceptable risk: Wasabi eu-central-1 availability is the SLA boundary for long-term metrics. If the bucket is lost, Prometheus retains 15 days of local TSDB on vps-i1.

Restore

Restore Prometheus metrics from Wasabi:

# List blocks in the bucket
docker run --rm \
  -v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
  quay.io/thanos/thanos:latest \
  tools bucket ls --objstore.config-file /s3.yml
 
# Download a specific block (for manual inspection)
s3cmd get s3://ecotrans-monitoring/<BLOCK_ULID>/ /tmp/block/ --recursive \
  --host=s3.eu-central-1.wasabisys.com

Thanos Query reads directly from Wasabi in normal operation — no restore needed for query access. Restore is only necessary if rebuilding a local Prometheus from scratch.

Restore Traccar backup:

s3cmd get s3://ecotrans-monitoring/traccar/traccar-YYYY-MM-DD.sql.gz /tmp/ \
  --host=s3.eu-central-1.wasabisys.com
gunzip /tmp/traccar-YYYY-MM-DD.sql.gz
mysql -u traccar -p traccar < /tmp/traccar-YYYY-MM-DD.sql

Restore Grafana backup: see docs/grafana-operations.md → Restore → Scenario 2.

Healthcheck / Monitoring

backup-exporter (port :9220) fetches backups/supabase/metrics/backup-status.prom from s3://p24-infra/ (eu-central-2) on each scrape and re-exposes the metrics to Prometheus. Prometheus rule BackupStale fires if the last backup is older than 26 hours. The exporter uses P24_INFRA_WASABI_ACCESS_KEY / P24_INFRA_WASABI_SECRET_KEY.

Manual freshness check for Thanos/ecotrans-monitoring:

s3cmd ls s3://ecotrans-monitoring/ --host=s3.eu-central-1.wasabisys.com | tail -5
# Verify recent timestamps
 
docker compose logs thanos-sidecar | tail -20
# Look for: "uploaded block" or errors

Manual freshness check for p24-infra bucket (Supabase backup metrics):

s3cmd ls s3://p24-infra/backups/supabase/metrics/ --host=s3.eu-central-2.wasabisys.com
# Should show a recently updated backup-status.prom

Password / Credential Rotation

CredentialIAM userTracked entryRotation frequencyLast rotated
WASABI_ACCESS_KEY + WASABI_SECRET_KEY (monitoring bucket, eu-central-1)monitoring userdev_r_serviceswasabi-s3180d2026-05-13
P24_INFRA_WASABI_ACCESS_KEY + P24_INFRA_WASABI_SECRET_KEY (p24-infra bucket, eu-central-2)p24-infradev_r_serviceswasabi-s390d2026-06-12

Important: The p24-infra IAM key rotation must be performed via the Wasabi IAM admin API (not the console web UI) when running on the IONOS VPS. The Wasabi console at console.wasabisys.com has an SSL compatibility issue with some Windows clients — use the VPS approach to avoid that problem.

To rotate P24_INFRA_WASABI_ACCESS_KEY (p24-infra IAM user, eu-central-2):

# Step 1: SSH into vps-i1 (or run from local if SSL is fine)
ssh root@217.154.82.162
 
# Step 2: Create a new key via Wasabi IAM API
# (Requires admin-level Wasabi access key with IAM permissions)
# Use the Wasabi console → IAM → Users → p24-infra → Security credentials → Create access key
# OR via API:
# curl -s -X POST "https://iam.wasabisys.com/" \
#   -H "Authorization: AWS4-HMAC-SHA256 ..." \
#   --data "Action=CreateAccessKey&UserName=p24-infra"
 
# Step 3: Update /opt/p24-infra/monitoring/.env on vps-i1
# P24_INFRA_WASABI_ACCESS_KEY=<new_access_key>
# P24_INFRA_WASABI_SECRET_KEY=<new_secret_key>
 
# Step 4: Restart backup-exporter (it reads credentials at startup)
cd /opt/p24-infra/monitoring && docker compose restart backup-exporter
 
# Step 5: Verify backup-exporter can read from Wasabi
docker compose logs --tail=20 backup-exporter
curl -s http://localhost:9220/metrics | grep backup_
 
# Step 6: Update GitHub Secrets
gh secret set P24_INFRA_WASABI_ACCESS_KEY -b "<new>" -R radieu/p24-infra
gh secret set P24_INFRA_WASABI_SECRET_KEY -b "<new>" -R radieu/p24-infra
 
# Step 7: Update .env.local on local workstation
# P24_INFRA_WASABI_ACCESS_KEY=<new_access_key>
# P24_INFRA_WASABI_SECRET_KEY=<new_secret_key>
 
# Step 8: Delete the old key from Wasabi console / IAM API
 
# Step 9: Log rotation
# Append to docs/secrets-rotation-log.md and update dev_r_services

To rotate WASABI_ACCESS_KEY (monitoring bucket, eu-central-1):

  1. Wasabi console → Access Keys → Create new key pair.
  2. Update .env on vps-i1 (edit /opt/p24-infra/monitoring/.env):
    • WASABI_ACCESS_KEY=<new>
    • WASABI_SECRET_KEY=<new>
  3. Restart Thanos sidecar (it reads credentials at startup):
    cd /opt/p24-infra/monitoring && docker compose restart thanos-sidecar
  4. Update GH Secrets:
    gh secret set WASABI_ACCESS_KEY -b "<new>" -R radieu/p24-infra
    gh secret set WASABI_SECRET_KEY -b "<new>" -R radieu/p24-infra
  5. Update .env.local on local workstation.
  6. Delete old access key in Wasabi console.
  7. Verify Thanos upload resumes: docker compose logs thanos-sidecar | grep -i upload
  8. Log rotation in docs/secrets-rotation-log.md.

Mailgun EU

Architecture

Mailgun EU is the SMTP relay used by Alertmanager to deliver alert emails to radieu@gmail.com. It is a stateless relay — no data is persisted here; it is not a backup target.

Alertmanager (vps-i1:9093)
└── SMTP → smtp.eu.mailgun.org:587 (STARTTLS) → radieu@gmail.com

SMTP config:

FieldValue
Hostsmtp.eu.mailgun.org
Port587
EncryptionSTARTTLS
AuthUsername + password
Sender domainConfigured in Mailgun EU account

Config Management

FileIn repo?Contains secrets?
monitoring/alertmanager/alertmanager.ymlYesNo (credentials via env)
monitoring/.envNoYes — SMTP_HOST, SMTP_USER, SMTP_PASSWORD

Alertmanager reads SMTP_USER and SMTP_PASSWORD from environment variables injected via .env at container start.

Backup

Not applicable. Mailgun is a stateless SMTP relay. There is no data to back up. Configuration (domain, sending limits) is managed in the Mailgun EU dashboard.

If Mailgun becomes unavailable, the fallback is to switch alertmanager.yml to another SMTP provider and restart alertmanager. No data is lost.

Restore

Alertmanager stops sending email:

  1. Check alertmanager logs: docker compose logs alertmanager | tail -30
  2. Verify SMTP credentials in .env: SMTP_USER, SMTP_PASSWORD
  3. Test SMTP connectivity from vps-i1:
    curl --url "smtp://smtp.eu.mailgun.org:587" \
      --ssl-reqd --mail-from sender@domain.com \
      --mail-rcpt radieu@gmail.com \
      --user "${SMTP_USER}:${SMTP_PASSWORD}" \
      -T /dev/null
  4. If credentials are correct but delivery fails, check Mailgun dashboard for account suspension or quota exhaustion.
  5. Hot-reload alertmanager after any config fix: curl -X POST http://localhost:9093/-/reload

Switching to a backup SMTP provider:

  1. Edit monitoring/alertmanager/alertmanager.yml — update smtp_smarthost, smtp_auth_username, smtp_auth_password.
  2. Update .env on vps-i1 with new credentials.
  3. Reload: curl -X POST http://localhost:9093/-/reload

Healthcheck / Monitoring

No dedicated Prometheus probe. Alertmanager itself is monitored via the /-/healthy Docker healthcheck and external blackbox probe.

To verify email delivery end-to-end, send a test alert:

# Fire a test alert via Alertmanager API
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"Manual test alert"}}]'
# Check radieu@gmail.com within 2 minutes

Check delivery rates and bounce reports in the Mailgun EU dashboard.

Password / Credential Rotation

CredentialTracked entryRotation frequency
SMTP_USER + SMTP_PASSWORDdev_r_servicesmailgun-eu365d

To rotate:

  1. Mailgun EU dashboard → Sending → Domain settings → SMTP credentials → Reset password (or create new credential and delete old).
  2. Update .env on vps-i1:
    # Edit /opt/p24-infra/monitoring/.env
    SMTP_USER=<new_user>
    SMTP_PASSWORD=<new_pass>
  3. Reload alertmanager (it re-reads env at startup or via SIGHUP):
    docker compose restart alertmanager
  4. Update GH Secrets:
    gh secret set SMTP_USER -b "<new_user>" -R radieu/p24-infra
    gh secret set SMTP_PASSWORD -b "<new_pass>" -R radieu/p24-infra
  5. Update .env.local on local workstation.
  6. Send test alert to verify delivery (see Healthcheck section).
  7. Log rotation in docs/secrets-rotation-log.md.