Cloud Services — Operations Workbook
Covers: Cloudflare DNS, GitHub, Vercel, Wasabi S3, Mailgun EU. All are external SaaS dependencies of the p24-infra stack.
Cloudflare DNS
Architecture
Cloudflare manages DNS for zintegrowana.online (zone ID 57cb3d8f24c7cc319fb703394edc7b87, Free plan, DNS-only — no Cloudflare proxy). All infrastructure subdomains follow the pattern {service}.{vps-label}.infra.zintegrowana.online.
zintegrowana.online (Cloudflare Free, DNS-only)
│
├── *.vps-i1.infra.zintegrowana.online → A 217.154.82.162 (IONOS VPS)
├── *.vps-h1.infra.zintegrowana.online → A 72.60.32.61 (Hostinger VPS)
└── n8n-cloud.infra.zintegrowana.online → CNAME p24.app.n8n.cloudWildcard A records cover all subdomains on each VPS — adding a new service requires only a Caddy/Traefik config change, no DNS change.
DNS manager CLI (any VPS with CF_API_TOKEN + CF_ZONE_ID in env):
python3 /opt/p24-infra/scripts/dns-manager.py list
python3 /opt/p24-infra/scripts/dns-manager.py upsert <name> <ip>
python3 /opt/p24-infra/scripts/dns-manager.py delete <name>Config Management
| Item | Managed via |
|---|---|
| Wildcard A records | dns-manager.py + Cloudflare API |
API token CF_API_TOKEN | Cloudflare dashboard → My Profile → API Tokens |
Scoped token CLOUDFLARE_TOKEN_ZINTEGROWANA | Same — restricted to DNS edit on zintegrowana.online |
| Zone config (TTL, security settings) | Cloudflare dashboard (manual) |
Zone config is minimal (Free plan, DNS-only). Record state is declarative and re-creatable from script. Zone ID is not a secret — committed to CLAUDE.md.
Backup
Export current DNS records at any time:
curl -s -H "Authorization: Bearer $CF_API_TOKEN" \
"https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records?per_page=100" \
| python3 -m json.tool > /tmp/cloudflare-dns-export-$(date +%F).jsonRun this before any bulk DNS change. Save the output to docs/backups/ or upload to Wasabi. The two wildcard records are documented in CLAUDE.md — trivial to re-create manually.
Restore
Records lost, zone still exists:
# Re-add wildcard records
python3 /opt/p24-infra/scripts/dns-manager.py upsert "*.vps-i1.infra.zintegrowana.online" 217.154.82.162
python3 /opt/p24-infra/scripts/dns-manager.py upsert "*.vps-h1.infra.zintegrowana.online" 72.60.32.61Recovery time: < 5 minutes. DNS propagation via Cloudflare is near-instant (< 60s TTL).
Zone deleted (worst case):
- Re-add
zintegrowana.onlineto Cloudflare via dashboard. - Update nameservers at registrar to point to Cloudflare nameservers.
- Nameserver propagation: up to 24h.
- Re-apply all DNS records via
dns-manager.py.
Healthcheck / Monitoring
No Prometheus alert. Manual check during incident:
dig grafana.vps-i1.infra.zintegrowana.online @1.1.1.1 +short
# Expected: 217.154.82.162
dig n8n.vps-h1.infra.zintegrowana.online @1.1.1.1 +short
# Expected: 72.60.32.61Blackbox exporter DNS probe can be added to monitoring/prometheus/blackbox.yml if this becomes a pain point.
Password / Credential Rotation
| Credential | Tracked entry | Rotation frequency |
|---|---|---|
| Cloudflare account password | Cloudflare dashboard + password manager | 180d |
CLOUDFLARE_TOKEN_ZINTEGROWANA (DNS-edit scope) | dev_r_services — cloudflare-dns | 180d |
CF_API_TOKEN (broader scope) | dev_r_services — cloudflare-dns | 180d |
To rotate API token:
- Cloudflare dashboard → My Profile → API Tokens → Delete old token → Create new.
- Update
CF_API_TOKEN+CLOUDFLARE_TOKEN_ZINTEGROWANAin:monitoring/.envon vps-i1 (via SSH)monitoring/.envon vps-h1 (via SSH)- GitHub Secrets:
gh secret set CF_API_TOKEN -b "<new>" -R radieu/p24-infra .env.localon local workstation
- Log rotation in
docs/secrets-rotation-log.md.
GitHub
Architecture
GitHub (github.com/radieu) hosts code, CI/CD via GitHub Actions, issue tracking, and PR reviews for both radieu/p24-infra and radieu/et-operational-platform. Self-hosted runners are not used — all Actions run on GitHub-hosted runners.
VPS AI agents (AI-Dev-IO1, AI-Dev-HS1) are collaborators with write access to both repos.
radieu/p24-infra — infra config, monitoring stack, exporters, Ansible
radieu/et-operational-platform — Next.js frontend + backendConfig Management
| Item | Location |
|---|---|
| Repo code + history | Git (distributed — local + VPS clones) |
| GitHub Actions workflows | .github/workflows/ in each repo |
| Actions secrets | GitHub repository Secrets UI (no automated export) |
| Branch protection rules | GitHub UI |
Actions secrets have no export API. The authoritative copy of each secret value is in .env.local on the local workstation. Any new secret added to GitHub must also be added to .env.local.
Backup
All code is replicated across:
- Local workstation (
d:\code_2026\p24-infra) - IONOS VPS (
/opt/p24-infra) - Hostinger VPS (
/opt/p24-infra) - GitHub itself
Loss of GitHub access does not mean data loss — work from any local clone. Actions secrets are not replicated automatically: the only backup is .env.local on the local workstation.
Restore
GitHub outage: Work offline from local clone. Push when service recovers.
Repo accidentally deleted: Contact GitHub support with account credentials. Recovery window: 90 days (GitHub trash policy). In parallel, push from a local clone to a new repo.
Actions secret lost: Restore from .env.local → GitHub UI or gh secret set.
Healthcheck / Monitoring
No Prometheus probe. GitHub provides its own status page at githubstatus.com.
Manual check if CI is broken:
gh run list --repo radieu/p24-infra --limit 5Password / Credential Rotation
| Credential | Tracked entry | Rotation frequency |
|---|---|---|
| GitHub account password | Password manager + 2FA | 180d |
GH_TOKEN (PAT — runner registration + API) | dev_r_services — github | 90d |
GH_PAT (PAT — health-check workflow) | dev_r_services — github | 90d |
To rotate a PAT:
- GitHub → Settings → Developer settings → Personal access tokens → Generate new token.
- Set same scopes as old token (repo, workflow, write:packages as needed).
- Update in:
gh secret set GH_TOKEN -b "<new>" -R radieu/p24-infragh secret set GH_PAT -b "<new>" -R radieu/p24-infra.env.localon local workstation
- Delete old token in GitHub UI.
- Log rotation in
docs/secrets-rotation-log.md.
Vercel
Architecture
Vercel hosts the et-operational-platform Next.js frontend. Deployments are triggered automatically on push to main (production) and on PR branches (preview). The vercel-exporter container on vps-i1 polls the Vercel API every 5 minutes and exposes deployment state as Prometheus metrics.
GitHub push → Vercel build → Production deployment
└── vercel-exporter (port 9202) ─► Prometheus ─► GrafanaProduction URL: https://et-operational-platform.vercel.app (plus any custom domain configured in Vercel)
Config Management
| Item | Location | In repo? |
|---|---|---|
vercel.json | et-operational-platform/ repo root | Yes |
| Environment variables | Vercel dashboard + .env.local on local workstation | Dashboard (not in repo) |
| Project link | .vercel/ directory in repo | Yes |
| vercel-exporter config | monitoring/exporters/vercel-exporter/ | Yes |
Environment variables set in Vercel must be mirrored in .env.local. Do not rely solely on the Vercel dashboard — it has no export API for secret values.
Backup
| Data | Backup |
|---|---|
| Source code | Git repo (GitHub + local clones) |
| Build artifacts | Vercel stores last N deployments — available for instant rollback |
| Environment variables | .env.local on local workstation |
Project config (vercel.json) | Git repo |
Restore
Scenario 1: Bad deployment — rollback in Vercel:
# Via CLI
vercel rollback [deployment-url]
# Via dashboard: Vercel → Project → Deployments → select target → Promote to ProductionScenario 2: Project accidentally deleted or Vercel account lost:
- Create new Vercel project, link to GitHub repo.
- Re-add all environment variables from
.env.local. - Push to
mainto trigger first deployment.
Recovery time: < 10 minutes from code if env vars are ready.
Healthcheck / Monitoring
vercel-exporter (port :9202) scrapes https://api.vercel.com/v6/deployments for the last 20 deployments every 5 minutes. Exposes:
vercel_deployment_state— gauge by project + deployment URLvercel_deployments_total— count by project + state
Prometheus rule VercelDeploymentFailed alerts if any production deployment enters ERROR state.
Blackbox probe to the production health endpoint:
curl -s https://et-operational-platform.vercel.app/api/health
# Expected: 200 OKPassword / Credential Rotation
| Credential | Tracked entry | Rotation frequency |
|---|---|---|
| Vercel account password | Password manager | 180d |
VERCEL_TOKEN (API token) | dev_r_services — vercel | 90d |
Last rotated: 2026-05-08.
To rotate VERCEL_TOKEN:
- Vercel dashboard → Settings → Tokens → Create new token (full access or scoped as needed).
- Update in:
gh secret set VERCEL_TOKEN -b "<new>" -R radieu/p24-infragh secret set VERCEL_TOKEN -b "<new>" -R radieu/et-operational-platform.envon vps-i1 (vercel-exporter reads this).env.localon local workstation
- Restart vercel-exporter:
docker compose restart vercel-exporteron vps-i1. - Delete old token in Vercel dashboard.
- Log rotation in
docs/secrets-rotation-log.md.
Wasabi S3
Architecture
Wasabi S3 provides long-term object storage across two regions:
- Prometheus metrics — Thanos sidecar uploads 2h TSDB blocks from vps-i1 continuously to
s3://ecotrans-monitoring/(eu-central-1) - Traccar DB backups — nightly mysqldump uploaded to
s3://ecotrans-monitoring/traccar/(eu-central-1) - Grafana volume backups — nightly
grafana_datatar uploaded tos3://p24-infra/grafana/(eu-central-2, via backup-ionos GH Action) - Supabase backup metrics —
backup-exporterreadsbackups/supabase/metrics/backup-status.promfroms3://p24-infra/(eu-central-2) to expose backup freshness to Prometheus
vps-i1
├── thanos-sidecar ──────────────────────────────► s3://ecotrans-monitoring/ (eu-central-1)
│ (continuous, 2h blocks, prometheus metrics)
├── backup script (nightly) ────────────────────► s3://ecotrans-monitoring/traccar/ (eu-central-1)
│ (Traccar mysqldump)
├── GH Action grafana-backup.yml (nightly) ─────► s3://p24-infra/grafana/ (eu-central-2)
│ (grafana_data volume tar.gz)
└── backup-exporter (reads) ─────────────────────► s3://p24-infra/backups/supabase/metrics/ (eu-central-2)
(backup-status.prom written by supabase-backup GHA workflow)
GitHub Actions (supabase-backup workflow)
└── writes backup-status.prom ──────────────────► s3://p24-infra/backups/supabase/metrics/ (eu-central-2)Buckets:
| Bucket | Region | Endpoint | Purpose | IAM key used |
|---|---|---|---|---|
ecotrans-monitoring | eu-central-1 | s3.eu-central-1.wasabisys.com | Production: Thanos metrics + Traccar backups | WASABI_ACCESS_KEY |
ecotrans-monitoring-test | eu-central-1 | s3.eu-central-1.wasabisys.com | Testing only — never use for production data | WASABI_ACCESS_KEY |
p24-infra | eu-central-2 | s3.eu-central-2.wasabisys.com | Grafana backups + Supabase backup metrics | P24_INFRA_WASABI_ACCESS_KEY |
IAM users (Wasabi account 100000049371):
| IAM user | ARN | Keys stored in | Buckets accessed |
|---|---|---|---|
p24-infra | arn:aws:iam::100000049371:user/p24-infra | P24_INFRA_WASABI_ACCESS_KEY/SECRET_KEY | p24-infra (eu-central-2) |
| (monitoring user) | — | WASABI_ACCESS_KEY/SECRET_KEY | ecotrans-monitoring (eu-central-1) |
There is no cross-region replication. Loss of eu-central-1 affects long-term Prometheus history. Loss of eu-central-2 affects Grafana backup restore capability and Supabase backup monitoring visibility.
Config Management
| File | In repo? | Purpose |
|---|---|---|
monitoring/thanos/s3.yml | Yes (template) | Wasabi config for Thanos (eu-central-1) — credentials injected from .env at runtime |
monitoring/.env | No (.env.example only) | Contains WASABI_ACCESS_KEY, WASABI_SECRET_KEY, P24_INFRA_WASABI_ACCESS_KEY, P24_INFRA_WASABI_SECRET_KEY |
| GH Secrets | GH UI | WASABI_ACCESS_KEY, WASABI_SECRET_KEY, P24_INFRA_WASABI_ACCESS_KEY, P24_INFRA_WASABI_SECRET_KEY |
.env.local (local) | No | P24_INFRA_WASABI_ACCESS_KEY, P24_INFRA_WASABI_SECRET_KEY (and monitoring keys) |
Do not commit credentials. The s3.yml template in the repo contains placeholders resolved at runtime.
Critical: The backup-exporter container uses P24_INFRA_WASABI_ACCESS_KEY / P24_INFRA_WASABI_SECRET_KEY (eu-central-2, bucket p24-infra). Do NOT use the general WASABI_ACCESS_KEY for it — different region, different bucket.
Backup
Wasabi is the backup target. The bucket itself is not backed up elsewhere. Acceptable risk: Wasabi eu-central-1 availability is the SLA boundary for long-term metrics. If the bucket is lost, Prometheus retains 15 days of local TSDB on vps-i1.
Restore
Restore Prometheus metrics from Wasabi:
# List blocks in the bucket
docker run --rm \
-v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
quay.io/thanos/thanos:latest \
tools bucket ls --objstore.config-file /s3.yml
# Download a specific block (for manual inspection)
s3cmd get s3://ecotrans-monitoring/<BLOCK_ULID>/ /tmp/block/ --recursive \
--host=s3.eu-central-1.wasabisys.comThanos Query reads directly from Wasabi in normal operation — no restore needed for query access. Restore is only necessary if rebuilding a local Prometheus from scratch.
Restore Traccar backup:
s3cmd get s3://ecotrans-monitoring/traccar/traccar-YYYY-MM-DD.sql.gz /tmp/ \
--host=s3.eu-central-1.wasabisys.com
gunzip /tmp/traccar-YYYY-MM-DD.sql.gz
mysql -u traccar -p traccar < /tmp/traccar-YYYY-MM-DD.sqlRestore Grafana backup: see docs/grafana-operations.md → Restore → Scenario 2.
Healthcheck / Monitoring
backup-exporter (port :9220) fetches backups/supabase/metrics/backup-status.prom from s3://p24-infra/ (eu-central-2) on each scrape and re-exposes the metrics to Prometheus. Prometheus rule BackupStale fires if the last backup is older than 26 hours. The exporter uses P24_INFRA_WASABI_ACCESS_KEY / P24_INFRA_WASABI_SECRET_KEY.
Manual freshness check for Thanos/ecotrans-monitoring:
s3cmd ls s3://ecotrans-monitoring/ --host=s3.eu-central-1.wasabisys.com | tail -5
# Verify recent timestamps
docker compose logs thanos-sidecar | tail -20
# Look for: "uploaded block" or errorsManual freshness check for p24-infra bucket (Supabase backup metrics):
s3cmd ls s3://p24-infra/backups/supabase/metrics/ --host=s3.eu-central-2.wasabisys.com
# Should show a recently updated backup-status.promPassword / Credential Rotation
| Credential | IAM user | Tracked entry | Rotation frequency | Last rotated |
|---|---|---|---|---|
WASABI_ACCESS_KEY + WASABI_SECRET_KEY (monitoring bucket, eu-central-1) | monitoring user | dev_r_services — wasabi-s3 | 180d | 2026-05-13 |
P24_INFRA_WASABI_ACCESS_KEY + P24_INFRA_WASABI_SECRET_KEY (p24-infra bucket, eu-central-2) | p24-infra | dev_r_services — wasabi-s3 | 90d | 2026-06-12 |
Important: The p24-infra IAM key rotation must be performed via the Wasabi IAM admin API (not the console web UI) when running on the IONOS VPS. The Wasabi console at console.wasabisys.com has an SSL compatibility issue with some Windows clients — use the VPS approach to avoid that problem.
To rotate P24_INFRA_WASABI_ACCESS_KEY (p24-infra IAM user, eu-central-2):
# Step 1: SSH into vps-i1 (or run from local if SSL is fine)
ssh root@217.154.82.162
# Step 2: Create a new key via Wasabi IAM API
# (Requires admin-level Wasabi access key with IAM permissions)
# Use the Wasabi console → IAM → Users → p24-infra → Security credentials → Create access key
# OR via API:
# curl -s -X POST "https://iam.wasabisys.com/" \
# -H "Authorization: AWS4-HMAC-SHA256 ..." \
# --data "Action=CreateAccessKey&UserName=p24-infra"
# Step 3: Update /opt/p24-infra/monitoring/.env on vps-i1
# P24_INFRA_WASABI_ACCESS_KEY=<new_access_key>
# P24_INFRA_WASABI_SECRET_KEY=<new_secret_key>
# Step 4: Restart backup-exporter (it reads credentials at startup)
cd /opt/p24-infra/monitoring && docker compose restart backup-exporter
# Step 5: Verify backup-exporter can read from Wasabi
docker compose logs --tail=20 backup-exporter
curl -s http://localhost:9220/metrics | grep backup_
# Step 6: Update GitHub Secrets
gh secret set P24_INFRA_WASABI_ACCESS_KEY -b "<new>" -R radieu/p24-infra
gh secret set P24_INFRA_WASABI_SECRET_KEY -b "<new>" -R radieu/p24-infra
# Step 7: Update .env.local on local workstation
# P24_INFRA_WASABI_ACCESS_KEY=<new_access_key>
# P24_INFRA_WASABI_SECRET_KEY=<new_secret_key>
# Step 8: Delete the old key from Wasabi console / IAM API
# Step 9: Log rotation
# Append to docs/secrets-rotation-log.md and update dev_r_servicesTo rotate WASABI_ACCESS_KEY (monitoring bucket, eu-central-1):
- Wasabi console → Access Keys → Create new key pair.
- Update
.envon vps-i1 (edit/opt/p24-infra/monitoring/.env):WASABI_ACCESS_KEY=<new>WASABI_SECRET_KEY=<new>
- Restart Thanos sidecar (it reads credentials at startup):
cd /opt/p24-infra/monitoring && docker compose restart thanos-sidecar - Update GH Secrets:
gh secret set WASABI_ACCESS_KEY -b "<new>" -R radieu/p24-infra gh secret set WASABI_SECRET_KEY -b "<new>" -R radieu/p24-infra - Update
.env.localon local workstation. - Delete old access key in Wasabi console.
- Verify Thanos upload resumes:
docker compose logs thanos-sidecar | grep -i upload - Log rotation in
docs/secrets-rotation-log.md.
Mailgun EU
Architecture
Mailgun EU is the SMTP relay used by Alertmanager to deliver alert emails to radieu@gmail.com. It is a stateless relay — no data is persisted here; it is not a backup target.
Alertmanager (vps-i1:9093)
└── SMTP → smtp.eu.mailgun.org:587 (STARTTLS) → radieu@gmail.comSMTP config:
| Field | Value |
|---|---|
| Host | smtp.eu.mailgun.org |
| Port | 587 |
| Encryption | STARTTLS |
| Auth | Username + password |
| Sender domain | Configured in Mailgun EU account |
Config Management
| File | In repo? | Contains secrets? |
|---|---|---|
monitoring/alertmanager/alertmanager.yml | Yes | No (credentials via env) |
monitoring/.env | No | Yes — SMTP_HOST, SMTP_USER, SMTP_PASSWORD |
Alertmanager reads SMTP_USER and SMTP_PASSWORD from environment variables injected via .env at container start.
Backup
Not applicable. Mailgun is a stateless SMTP relay. There is no data to back up. Configuration (domain, sending limits) is managed in the Mailgun EU dashboard.
If Mailgun becomes unavailable, the fallback is to switch alertmanager.yml to another SMTP provider and restart alertmanager. No data is lost.
Restore
Alertmanager stops sending email:
- Check alertmanager logs:
docker compose logs alertmanager | tail -30 - Verify SMTP credentials in
.env:SMTP_USER,SMTP_PASSWORD - Test SMTP connectivity from vps-i1:
curl --url "smtp://smtp.eu.mailgun.org:587" \ --ssl-reqd --mail-from sender@domain.com \ --mail-rcpt radieu@gmail.com \ --user "${SMTP_USER}:${SMTP_PASSWORD}" \ -T /dev/null - If credentials are correct but delivery fails, check Mailgun dashboard for account suspension or quota exhaustion.
- Hot-reload alertmanager after any config fix:
curl -X POST http://localhost:9093/-/reload
Switching to a backup SMTP provider:
- Edit
monitoring/alertmanager/alertmanager.yml— updatesmtp_smarthost,smtp_auth_username,smtp_auth_password. - Update
.envon vps-i1 with new credentials. - Reload:
curl -X POST http://localhost:9093/-/reload
Healthcheck / Monitoring
No dedicated Prometheus probe. Alertmanager itself is monitored via the /-/healthy Docker healthcheck and external blackbox probe.
To verify email delivery end-to-end, send a test alert:
# Fire a test alert via Alertmanager API
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"Manual test alert"}}]'
# Check radieu@gmail.com within 2 minutesCheck delivery rates and bounce reports in the Mailgun EU dashboard.
Password / Credential Rotation
| Credential | Tracked entry | Rotation frequency |
|---|---|---|
SMTP_USER + SMTP_PASSWORD | dev_r_services — mailgun-eu | 365d |
To rotate:
- Mailgun EU dashboard → Sending → Domain settings → SMTP credentials → Reset password (or create new credential and delete old).
- Update
.envon vps-i1:# Edit /opt/p24-infra/monitoring/.env SMTP_USER=<new_user> SMTP_PASSWORD=<new_pass> - Reload alertmanager (it re-reads env at startup or via SIGHUP):
docker compose restart alertmanager - Update GH Secrets:
gh secret set SMTP_USER -b "<new_user>" -R radieu/p24-infra gh secret set SMTP_PASSWORD -b "<new_pass>" -R radieu/p24-infra - Update
.env.localon local workstation. - Send test alert to verify delivery (see Healthcheck section).
- Log rotation in
docs/secrets-rotation-log.md.