Cloud Services — Operations Workbook

Covers: Cloudflare DNS, GitHub, Vercel, Wasabi S3, Mailgun EU. All are external SaaS dependencies of the p24-infra stack.

Cloudflare DNS

Architecture

Cloudflare manages DNS for zintegrowana.online (zone ID 57cb3d8f24c7cc319fb703394edc7b87, Free plan, DNS-only — no Cloudflare proxy). All infrastructure subdomains follow the pattern {service}.{vps-label}.infra.zintegrowana.online.

zintegrowana.online (Cloudflare Free, DNS-only)
│
├── *.vps-i1.infra.zintegrowana.online  →  A  217.154.82.162  (IONOS VPS)
├── *.vps-h1.infra.zintegrowana.online  →  A  72.60.32.61     (Hostinger VPS)
└── n8n-cloud.infra.zintegrowana.online →  CNAME p24.app.n8n.cloud

Wildcard A records cover all subdomains on each VPS — adding a new service requires only a Caddy/Traefik config change, no DNS change.

DNS manager CLI (any VPS with CF_API_TOKEN + CF_ZONE_ID in env):

python3 /opt/p24-infra/scripts/dns-manager.py list
python3 /opt/p24-infra/scripts/dns-manager.py upsert <name> <ip>
python3 /opt/p24-infra/scripts/dns-manager.py delete <name>

Config Management

Item	Managed via
Wildcard A records	`dns-manager.py` + Cloudflare API
API token `CF_API_TOKEN`	Cloudflare dashboard → My Profile → API Tokens
Scoped token `CLOUDFLARE_TOKEN_ZINTEGROWANA`	Same — restricted to DNS edit on `zintegrowana.online`
Zone config (TTL, security settings)	Cloudflare dashboard (manual)

Zone config is minimal (Free plan, DNS-only). Record state is declarative and re-creatable from script. Zone ID is not a secret — committed to CLAUDE.md.

Backup

Export current DNS records at any time:

curl -s -H "Authorization: Bearer $CF_API_TOKEN" \
  "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records?per_page=100" \
  | python3 -m json.tool > /tmp/cloudflare-dns-export-$(date +%F).json

Run this before any bulk DNS change. Save the output to docs/backups/ or upload to Wasabi. The two wildcard records are documented in CLAUDE.md — trivial to re-create manually.

Restore

Records lost, zone still exists:

# Re-add wildcard records
python3 /opt/p24-infra/scripts/dns-manager.py upsert "*.vps-i1.infra.zintegrowana.online" 217.154.82.162
python3 /opt/p24-infra/scripts/dns-manager.py upsert "*.vps-h1.infra.zintegrowana.online" 72.60.32.61

Recovery time: < 5 minutes. DNS propagation via Cloudflare is near-instant (< 60s TTL).

Zone deleted (worst case):

Re-add zintegrowana.online to Cloudflare via dashboard.
Update nameservers at registrar to point to Cloudflare nameservers.
Nameserver propagation: up to 24h.
Re-apply all DNS records via dns-manager.py.

Healthcheck / Monitoring

No Prometheus alert. Manual check during incident:

dig grafana.vps-i1.infra.zintegrowana.online @1.1.1.1 +short
# Expected: 217.154.82.162
 
dig n8n.vps-h1.infra.zintegrowana.online @1.1.1.1 +short
# Expected: 72.60.32.61

Blackbox exporter DNS probe can be added to monitoring/prometheus/blackbox.yml if this becomes a pain point.

Password / Credential Rotation

Credential	Tracked entry	Rotation frequency
Cloudflare account password	Cloudflare dashboard + password manager	180d
`CLOUDFLARE_TOKEN_ZINTEGROWANA` (DNS-edit scope)	`dev_r_services` — `cloudflare-dns`	180d
`CF_API_TOKEN` (broader scope)	`dev_r_services` — `cloudflare-dns`	180d

To rotate API token:

Cloudflare dashboard → My Profile → API Tokens → Delete old token → Create new.
Update CF_API_TOKEN + CLOUDFLARE_TOKEN_ZINTEGROWANA in:
- monitoring/.env on vps-i1 (via SSH)
- monitoring/.env on vps-h1 (via SSH)
- GitHub Secrets: gh secret set CF_API_TOKEN -b "<new>" -R radieu/p24-infra
- .env.local on local workstation
Log rotation in docs/secrets-rotation-log.md.

GitHub

Architecture

GitHub (github.com/radieu) hosts code, CI/CD via GitHub Actions, issue tracking, and PR reviews for both radieu/p24-infra and radieu/et-operational-platform. Self-hosted runners are not used — all Actions run on GitHub-hosted runners.

VPS AI agents (AI-Dev-IO1, AI-Dev-HS1) are collaborators with write access to both repos.

radieu/p24-infra              — infra config, monitoring stack, exporters, Ansible
radieu/et-operational-platform — Next.js frontend + backend

Config Management

Item	Location
Repo code + history	Git (distributed — local + VPS clones)
GitHub Actions workflows	`.github/workflows/` in each repo
Actions secrets	GitHub repository Secrets UI (no automated export)
Branch protection rules	GitHub UI

Actions secrets have no export API. The authoritative copy of each secret value is in .env.local on the local workstation. Any new secret added to GitHub must also be added to .env.local.

Backup

All code is replicated across:

Local workstation (d:\code_2026\p24-infra)
IONOS VPS (/opt/p24-infra)
Hostinger VPS (/opt/p24-infra)
GitHub itself

Loss of GitHub access does not mean data loss — work from any local clone. Actions secrets are not replicated automatically: the only backup is .env.local on the local workstation.

Restore

GitHub outage: Work offline from local clone. Push when service recovers.

Repo accidentally deleted: Contact GitHub support with account credentials. Recovery window: 90 days (GitHub trash policy). In parallel, push from a local clone to a new repo.

Actions secret lost: Restore from .env.local → GitHub UI or gh secret set.

Healthcheck / Monitoring

No Prometheus probe. GitHub provides its own status page at githubstatus.com.

Manual check if CI is broken:

gh run list --repo radieu/p24-infra --limit 5

Password / Credential Rotation

Credential	Tracked entry	Rotation frequency
GitHub account password	Password manager + 2FA	180d
`GH_TOKEN` (PAT — runner registration + API)	`dev_r_services` — `github`	90d
`GH_PAT` (PAT — health-check workflow)	`dev_r_services` — `github`	90d

To rotate a PAT:

GitHub → Settings → Developer settings → Personal access tokens → Generate new token.
Set same scopes as old token (repo, workflow, write:packages as needed).
Update in:
- gh secret set GH_TOKEN -b "<new>" -R radieu/p24-infra
- gh secret set GH_PAT -b "<new>" -R radieu/p24-infra
- .env.local on local workstation
Delete old token in GitHub UI.
Log rotation in docs/secrets-rotation-log.md.

Vercel

Architecture

Vercel hosts the et-operational-platform Next.js frontend. Deployments are triggered automatically on push to main (production) and on PR branches (preview). The vercel-exporter container on vps-i1 polls the Vercel API every 5 minutes and exposes deployment state as Prometheus metrics.

GitHub push → Vercel build → Production deployment
                              └── vercel-exporter (port 9202) ─► Prometheus ─► Grafana

Production URL: https://et-operational-platform.vercel.app (plus any custom domain configured in Vercel)

Config Management

Item	Location	In repo?
`vercel.json`	`et-operational-platform/` repo root	Yes
Environment variables	Vercel dashboard + `.env.local` on local workstation	Dashboard (not in repo)
Project link	`.vercel/` directory in repo	Yes
vercel-exporter config	`monitoring/exporters/vercel-exporter/`	Yes

Environment variables set in Vercel must be mirrored in .env.local. Do not rely solely on the Vercel dashboard — it has no export API for secret values.

Backup

Data	Backup
Source code	Git repo (GitHub + local clones)
Build artifacts	Vercel stores last N deployments — available for instant rollback
Environment variables	`.env.local` on local workstation
Project config (`vercel.json`)	Git repo

Restore

Scenario 1: Bad deployment — rollback in Vercel:

# Via CLI
vercel rollback [deployment-url]
 
# Via dashboard: Vercel → Project → Deployments → select target → Promote to Production

Scenario 2: Project accidentally deleted or Vercel account lost:

Create new Vercel project, link to GitHub repo.
Re-add all environment variables from .env.local.
Push to main to trigger first deployment.

Recovery time: < 10 minutes from code if env vars are ready.

Healthcheck / Monitoring

vercel-exporter (port :9202) scrapes https://api.vercel.com/v6/deployments for the last 20 deployments every 5 minutes. Exposes:

vercel_deployment_state — gauge by project + deployment URL
vercel_deployments_total — count by project + state

Prometheus rule VercelDeploymentFailed alerts if any production deployment enters ERROR state.

Blackbox probe to the production health endpoint:

curl -s https://et-operational-platform.vercel.app/api/health
# Expected: 200 OK

Password / Credential Rotation

Credential	Tracked entry	Rotation frequency
Vercel account password	Password manager	180d
`VERCEL_TOKEN` (API token)	`dev_r_services` — `vercel`	90d

Last rotated: 2026-05-08.

To rotate VERCEL_TOKEN:

Vercel dashboard → Settings → Tokens → Create new token (full access or scoped as needed).
Update in:
- gh secret set VERCEL_TOKEN -b "<new>" -R radieu/p24-infra
- gh secret set VERCEL_TOKEN -b "<new>" -R radieu/et-operational-platform
- .env on vps-i1 (vercel-exporter reads this)
- .env.local on local workstation
Restart vercel-exporter: docker compose restart vercel-exporter on vps-i1.
Delete old token in Vercel dashboard.
Log rotation in docs/secrets-rotation-log.md.

Wasabi S3

Architecture

Wasabi S3 provides long-term object storage across two regions:

Prometheus metrics — Thanos sidecar uploads 2h TSDB blocks from vps-i1 continuously to s3://ecotrans-monitoring/ (eu-central-1)
Traccar DB backups — nightly mysqldump uploaded to s3://ecotrans-monitoring/traccar/ (eu-central-1)
Grafana volume backups — nightly grafana_data tar uploaded to s3://p24-infra/grafana/ (eu-central-2, via backup-ionos GH Action)
Supabase backup metrics — backup-exporter reads backups/supabase/metrics/backup-status.prom from s3://p24-infra/ (eu-central-2) to expose backup freshness to Prometheus

vps-i1
├── thanos-sidecar ──────────────────────────────► s3://ecotrans-monitoring/         (eu-central-1)
│   (continuous, 2h blocks, prometheus metrics)
├── backup script (nightly) ────────────────────► s3://ecotrans-monitoring/traccar/  (eu-central-1)
│   (Traccar mysqldump)
├── GH Action grafana-backup.yml (nightly) ─────► s3://p24-infra/grafana/            (eu-central-2)
│   (grafana_data volume tar.gz)
└── backup-exporter (reads) ─────────────────────► s3://p24-infra/backups/supabase/metrics/  (eu-central-2)
    (backup-status.prom written by supabase-backup GHA workflow)
 
GitHub Actions (supabase-backup workflow)
└── writes backup-status.prom ──────────────────► s3://p24-infra/backups/supabase/metrics/  (eu-central-2)

Buckets:

Bucket	Region	Endpoint	Purpose	IAM key used
`ecotrans-monitoring`	eu-central-1	`s3.eu-central-1.wasabisys.com`	Production: Thanos metrics + Traccar backups	`WASABI_ACCESS_KEY`
`ecotrans-monitoring-test`	eu-central-1	`s3.eu-central-1.wasabisys.com`	Testing only — never use for production data	`WASABI_ACCESS_KEY`
`p24-infra`	eu-central-2	`s3.eu-central-2.wasabisys.com`	Grafana backups + Supabase backup metrics	`P24_INFRA_WASABI_ACCESS_KEY`

IAM users (Wasabi account 100000049371):

IAM user	ARN	Keys stored in	Buckets accessed
`p24-infra`	`arn:aws:iam::100000049371:user/p24-infra`	`P24_INFRA_WASABI_ACCESS_KEY/SECRET_KEY`	`p24-infra` (eu-central-2)
(monitoring user)	—	`WASABI_ACCESS_KEY/SECRET_KEY`	`ecotrans-monitoring` (eu-central-1)

There is no cross-region replication. Loss of eu-central-1 affects long-term Prometheus history. Loss of eu-central-2 affects Grafana backup restore capability and Supabase backup monitoring visibility.

Config Management

File	In repo?	Purpose
`monitoring/thanos/s3.yml`	Yes (template)	Wasabi config for Thanos (eu-central-1) — credentials injected from `.env` at runtime
`monitoring/.env`	No (`.env.example` only)	Contains `WASABI_ACCESS_KEY`, `WASABI_SECRET_KEY`, `P24_INFRA_WASABI_ACCESS_KEY`, `P24_INFRA_WASABI_SECRET_KEY`
GH Secrets	GH UI	`WASABI_ACCESS_KEY`, `WASABI_SECRET_KEY`, `P24_INFRA_WASABI_ACCESS_KEY`, `P24_INFRA_WASABI_SECRET_KEY`
`.env.local` (local)	No	`P24_INFRA_WASABI_ACCESS_KEY`, `P24_INFRA_WASABI_SECRET_KEY` (and monitoring keys)

Do not commit credentials. The s3.yml template in the repo contains placeholders resolved at runtime.

Critical: The backup-exporter container uses P24_INFRA_WASABI_ACCESS_KEY / P24_INFRA_WASABI_SECRET_KEY (eu-central-2, bucket p24-infra). Do NOT use the general WASABI_ACCESS_KEY for it — different region, different bucket.

Backup

Wasabi is the backup target. The bucket itself is not backed up elsewhere. Acceptable risk: Wasabi eu-central-1 availability is the SLA boundary for long-term metrics. If the bucket is lost, Prometheus retains 15 days of local TSDB on vps-i1.

Restore

Restore Prometheus metrics from Wasabi:

# List blocks in the bucket
docker run --rm \
  -v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
  quay.io/thanos/thanos:latest \
  tools bucket ls --objstore.config-file /s3.yml
 
# Download a specific block (for manual inspection)
s3cmd get s3://ecotrans-monitoring/<BLOCK_ULID>/ /tmp/block/ --recursive \
  --host=s3.eu-central-1.wasabisys.com

Thanos Query reads directly from Wasabi in normal operation — no restore needed for query access. Restore is only necessary if rebuilding a local Prometheus from scratch.

Restore Traccar backup:

s3cmd get s3://ecotrans-monitoring/traccar/traccar-YYYY-MM-DD.sql.gz /tmp/ \
  --host=s3.eu-central-1.wasabisys.com
gunzip /tmp/traccar-YYYY-MM-DD.sql.gz
mysql -u traccar -p traccar < /tmp/traccar-YYYY-MM-DD.sql

Restore Grafana backup: see docs/grafana-operations.md → Restore → Scenario 2.

Healthcheck / Monitoring

backup-exporter (port :9220) fetches backups/supabase/metrics/backup-status.prom from s3://p24-infra/ (eu-central-2) on each scrape and re-exposes the metrics to Prometheus. Prometheus rule BackupStale fires if the last backup is older than 26 hours. The exporter uses P24_INFRA_WASABI_ACCESS_KEY / P24_INFRA_WASABI_SECRET_KEY.

Manual freshness check for Thanos/ecotrans-monitoring:

s3cmd ls s3://ecotrans-monitoring/ --host=s3.eu-central-1.wasabisys.com | tail -5
# Verify recent timestamps
 
docker compose logs thanos-sidecar | tail -20
# Look for: "uploaded block" or errors

Manual freshness check for p24-infra bucket (Supabase backup metrics):

s3cmd ls s3://p24-infra/backups/supabase/metrics/ --host=s3.eu-central-2.wasabisys.com
# Should show a recently updated backup-status.prom

Password / Credential Rotation

Credential	IAM user	Tracked entry	Rotation frequency	Last rotated
`WASABI_ACCESS_KEY` + `WASABI_SECRET_KEY` (monitoring bucket, eu-central-1)	monitoring user	`dev_r_services` — `wasabi-s3`	180d	2026-05-13
`P24_INFRA_WASABI_ACCESS_KEY` + `P24_INFRA_WASABI_SECRET_KEY` (p24-infra bucket, eu-central-2)	`p24-infra`	`dev_r_services` — `wasabi-s3`	90d	2026-06-12

Important: The p24-infra IAM key rotation must be performed via the Wasabi IAM admin API (not the console web UI) when running on the IONOS VPS. The Wasabi console at console.wasabisys.com has an SSL compatibility issue with some Windows clients — use the VPS approach to avoid that problem.

To rotate P24_INFRA_WASABI_ACCESS_KEY (p24-infra IAM user, eu-central-2):

# Step 1: SSH into vps-i1 (or run from local if SSL is fine)
ssh root@217.154.82.162
 
# Step 2: Create a new key via Wasabi IAM API
# (Requires admin-level Wasabi access key with IAM permissions)
# Use the Wasabi console → IAM → Users → p24-infra → Security credentials → Create access key
# OR via API:
# curl -s -X POST "https://iam.wasabisys.com/" \
#   -H "Authorization: AWS4-HMAC-SHA256 ..." \
#   --data "Action=CreateAccessKey&UserName=p24-infra"
 
# Step 3: Update /opt/p24-infra/monitoring/.env on vps-i1
# P24_INFRA_WASABI_ACCESS_KEY=<new_access_key>
# P24_INFRA_WASABI_SECRET_KEY=<new_secret_key>
 
# Step 4: Restart backup-exporter (it reads credentials at startup)
cd /opt/p24-infra/monitoring && docker compose restart backup-exporter
 
# Step 5: Verify backup-exporter can read from Wasabi
docker compose logs --tail=20 backup-exporter
curl -s http://localhost:9220/metrics | grep backup_
 
# Step 6: Update GitHub Secrets
gh secret set P24_INFRA_WASABI_ACCESS_KEY -b "<new>" -R radieu/p24-infra
gh secret set P24_INFRA_WASABI_SECRET_KEY -b "<new>" -R radieu/p24-infra
 
# Step 7: Update .env.local on local workstation
# P24_INFRA_WASABI_ACCESS_KEY=<new_access_key>
# P24_INFRA_WASABI_SECRET_KEY=<new_secret_key>
 
# Step 8: Delete the old key from Wasabi console / IAM API
 
# Step 9: Log rotation
# Append to docs/secrets-rotation-log.md and update dev_r_services

To rotate WASABI_ACCESS_KEY (monitoring bucket, eu-central-1):

Wasabi console → Access Keys → Create new key pair.
Update .env on vps-i1 (edit /opt/p24-infra/monitoring/.env):
- WASABI_ACCESS_KEY=<new>
- WASABI_SECRET_KEY=<new>

Restart Thanos sidecar (it reads credentials at startup):

cd /opt/p24-infra/monitoring && docker compose restart thanos-sidecar

Update GH Secrets:

gh secret set WASABI_ACCESS_KEY -b "<new>" -R radieu/p24-infra
gh secret set WASABI_SECRET_KEY -b "<new>" -R radieu/p24-infra

Update .env.local on local workstation.
Delete old access key in Wasabi console.
Verify Thanos upload resumes: docker compose logs thanos-sidecar | grep -i upload
Log rotation in docs/secrets-rotation-log.md.

Mailgun EU

Architecture

Mailgun EU is the SMTP relay used by Alertmanager to deliver alert emails to radieu@gmail.com. It is a stateless relay — no data is persisted here; it is not a backup target.

Alertmanager (vps-i1:9093)
└── SMTP → smtp.eu.mailgun.org:587 (STARTTLS) → radieu@gmail.com

SMTP config:

Field	Value
Host	`smtp.eu.mailgun.org`
Port	`587`
Encryption	STARTTLS
Auth	Username + password
Sender domain	Configured in Mailgun EU account

Config Management

File	In repo?	Contains secrets?
`monitoring/alertmanager/alertmanager.yml`	Yes	No (credentials via env)
`monitoring/.env`	No	Yes — `SMTP_HOST`, `SMTP_USER`, `SMTP_PASSWORD`

Alertmanager reads SMTP_USER and SMTP_PASSWORD from environment variables injected via .env at container start.

Backup

Not applicable. Mailgun is a stateless SMTP relay. There is no data to back up. Configuration (domain, sending limits) is managed in the Mailgun EU dashboard.

If Mailgun becomes unavailable, the fallback is to switch alertmanager.yml to another SMTP provider and restart alertmanager. No data is lost.

Restore

Alertmanager stops sending email:

Check alertmanager logs: docker compose logs alertmanager | tail -30
Verify SMTP credentials in .env: SMTP_USER, SMTP_PASSWORD

Test SMTP connectivity from vps-i1:

curl --url "smtp://smtp.eu.mailgun.org:587" \
  --ssl-reqd --mail-from sender@domain.com \
  --mail-rcpt radieu@gmail.com \
  --user "${SMTP_USER}:${SMTP_PASSWORD}" \
  -T /dev/null

If credentials are correct but delivery fails, check Mailgun dashboard for account suspension or quota exhaustion.
Hot-reload alertmanager after any config fix: curl -X POST http://localhost:9093/-/reload

Switching to a backup SMTP provider:

Edit monitoring/alertmanager/alertmanager.yml — update smtp_smarthost, smtp_auth_username, smtp_auth_password.
Update .env on vps-i1 with new credentials.
Reload: curl -X POST http://localhost:9093/-/reload

Healthcheck / Monitoring

No dedicated Prometheus probe. Alertmanager itself is monitored via the /-/healthy Docker healthcheck and external blackbox probe.

To verify email delivery end-to-end, send a test alert:

# Fire a test alert via Alertmanager API
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"Manual test alert"}}]'
# Check radieu@gmail.com within 2 minutes

Check delivery rates and bounce reports in the Mailgun EU dashboard.

Password / Credential Rotation

Credential	Tracked entry	Rotation frequency
`SMTP_USER` + `SMTP_PASSWORD`	`dev_r_services` — `mailgun-eu`	365d

To rotate:

Mailgun EU dashboard → Sending → Domain settings → SMTP credentials → Reset password (or create new credential and delete old).

Update .env on vps-i1:

# Edit /opt/p24-infra/monitoring/.env
SMTP_USER=<new_user>
SMTP_PASSWORD=<new_pass>

Reload alertmanager (it re-reads env at startup or via SIGHUP):
```
docker compose restart alertmanager
```

Update GH Secrets:

gh secret set SMTP_USER -b "<new_user>" -R radieu/p24-infra
gh secret set SMTP_PASSWORD -b "<new_pass>" -R radieu/p24-infra

Update .env.local on local workstation.
Send test alert to verify delivery (see Healthcheck section).
Log rotation in docs/secrets-rotation-log.md.

p24-infra Docs

Explorer

cloud-services-operations

Cloud Services — Operations Workbook

Cloudflare DNS

Architecture

Config Management

Backup

Restore

Healthcheck / Monitoring

Password / Credential Rotation

GitHub

Architecture

Config Management

Backup

Restore

Healthcheck / Monitoring

Password / Credential Rotation

Vercel

Architecture

Config Management

Backup

Restore

Healthcheck / Monitoring

Password / Credential Rotation

Wasabi S3

Architecture

Config Management

Backup

Restore

Healthcheck / Monitoring

Password / Credential Rotation

Mailgun EU

Architecture

Config Management

Backup

Restore

Healthcheck / Monitoring

Password / Credential Rotation

Graph View

Table of Contents

Backlinks