vps-i1 — Operations Workbook
IONOS VPS (vps-i1). Primary infrastructure host: monitoring stack, Traccar GPS, OpenClaw WhatsApp gateway, GitHub Actions runner.
Architecture
IONOS VPS (217.154.82.162) AlmaLinux 9.7
CPU: AMD EPYC-Milan, 6 vCPUs
RAM: 8 GB
Role: monitoring host, GPS tracking, WhatsApp gateway, GH Actions runner
Caddy (443/80) ─── TLS termination for all public endpoints
│
├── monitoring-prometheus-1 :9090 (127.0.0.1 only)
├── monitoring-thanos-sidecar :10901 (uploads blocks → Wasabi)
├── monitoring-thanos-query :10904 (unified PromQL)
├── monitoring-alertmanager-1 :9093 (127.0.0.1 only)
├── monitoring-loki-1 :3100 (127.0.0.1 only)
├── monitoring-promtail-1 (log shipper, no public port)
├── monitoring-grafana-1 :3000 (127.0.0.1 only)
├── monitoring-blackbox-exporter :9115
├── monitoring-queue-exporter :9200
├── monitoring-pg-stats-exporter :9201
├── monitoring-cost-exporter :9210
├── monitoring-vercel-exporter :9202
├── monitoring-backup-exporter :9220
├── node_exporter :9100 (host network)
├── openclaw-openclaw-gateway-1 :18789-18790 (proxied via Caddy)
├── traccar :8082 (web), 5027/UDP (GPS)
├── traccar-db (MySQL, internal)
└── status.vps-i1 (Uptime Kuma) (proxied via Caddy)Compose files:
| Stack | File on server | File in repo |
|---|---|---|
| Monitoring | /opt/p24-infra/monitoring/docker-compose.yml | monitoring/docker-compose.yml |
| OpenClaw | /opt/p24-infra/openclaw/docker-compose.yml | openclaw/docker-compose.yml |
| Traccar | /opt/traccar/docker-compose.yml | not tracked |
Public URLs:
| Service | URL | Auth |
|---|---|---|
| Grafana | https://grafana.vps-i1.infra.zintegrowana.online | Grafana login |
| Prometheus | https://prometheus.vps-i1.infra.zintegrowana.online | basic_auth (admin / GRAFANA_ADMIN_PASSWORD) |
| Alertmanager | https://alertmanager.vps-i1.infra.zintegrowana.online | basic_auth (admin / GRAFANA_ADMIN_PASSWORD) |
| OpenClaw | https://openclaw.vps-i1.infra.zintegrowana.online | API key |
| Traccar | https://traccar.vps-i1.infra.zintegrowana.online | Traccar login |
| Status | https://status.vps-i1.infra.zintegrowana.online | Kuma login |
SSH Access
| User | Key | Scope |
|---|---|---|
root | C:\Users\konar\.ssh\id_ed25519 (local workstation) | Full admin |
claude-admin | GH Secret VPS_SSH_PRIVATE_KEY (ed25519) | Passwordless sudo: docker, systemctl, mkdir, chown, cp, tee |
# Python paramiko — non-interactive SSH from Windows
import paramiko
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect("217.154.82.162", port=22, username="root",
key_filename=r"C:\Users\konar\.ssh\id_ed25519", timeout=15)
stdin, stdout, stderr = client.exec_command("docker compose -f /opt/p24-infra/monitoring/docker-compose.yml ps")
print(stdout.read().decode())# Direct SSH
ssh -i C:\Users\konar\.ssh\id_ed25519 root@217.154.82.162Container Overview
| Container | Purpose | Ports | Stack |
|---|---|---|---|
monitoring-caddy-1 | TLS reverse proxy | 80, 443 | monitoring |
monitoring-grafana-1 | Dashboards | 127.0.0.1:3000 | monitoring |
monitoring-prometheus-1 | Metrics collection, 15d TSDB | 127.0.0.1:9090 | monitoring |
monitoring-thanos-sidecar | Uploads TSDB blocks → Wasabi | 10901 | monitoring |
monitoring-thanos-query | Unified PromQL (local + Wasabi) | 10904 | monitoring |
monitoring-alertmanager-1 | Alert routing → email | 127.0.0.1:9093 | monitoring |
monitoring-loki-1 | Log aggregation (14d retention) | 127.0.0.1:3100 | monitoring |
monitoring-promtail-1 | Ships Docker logs → Loki | — | monitoring |
monitoring-blackbox-exporter | HTTP/S probes | 9115 | monitoring |
monitoring-queue-exporter | Supabase queue depths → Prometheus | 9200 | monitoring |
monitoring-pg-stats-exporter | Supabase slow-query metrics → Prometheus | 9201 | monitoring |
monitoring-cost-exporter | Vercel/Supabase/Wasabi spend → Prometheus | 9210 | monitoring |
monitoring-vercel-exporter | Vercel deployment state → Prometheus | 9202 | monitoring |
monitoring-backup-exporter | Backup freshness → Prometheus | 9220 | monitoring |
node_exporter | Host OS metrics | 9100 (host net) | systemd |
openclaw-openclaw-gateway-1 | WhatsApp gateway | 18789-18790 | openclaw |
traccar | GPS tracking web UI | 8082 | traccar |
traccar-db | MySQL for Traccar | internal | traccar |
Config Management
| File / Directory | In repo? | Notes |
|---|---|---|
monitoring/docker-compose.yml | Yes | Full stack definition |
monitoring/prometheus/ | Yes | Scrape config + alert rules |
monitoring/alertmanager/alertmanager.yml | Yes | Alert routing |
monitoring/Caddyfile | Yes | Reverse proxy + TLS |
monitoring/.env | No (.env.example) | Secrets — on server only |
openclaw/docker-compose.yml | Yes | OpenClaw stack |
ansible/playbooks/provision-new-vps.yml | Yes | Full server provisioning |
Apply config change
# On vps-i1 — pull latest and reload affected service
cd /opt/p24-infra
git pull
# Hot-reload Prometheus rules (no restart)
curl -X POST http://localhost:9090/-/reload
# Hot-reload Alertmanager
curl -X POST http://localhost:9093/-/reload
# Caddy config reload (no restart)
docker compose -f monitoring/docker-compose.yml exec caddy caddy reload --config /etc/caddy/Caddyfile
# Full restart of a service
cd /opt/p24-infra/monitoring
docker compose restart <service>Backup
| Data | Method | Schedule | Destination |
|---|---|---|---|
| Prometheus TSDB | Thanos sidecar continuous upload | Every 2h (block upload) | s3://ecotrans-monitoring/ (Wasabi eu-central-1) |
| Config + rules | Git push | On every commit | GitHub radieu/p24-infra |
| SSH root key | GH Secret VPS_ROOT_SSH_KEY (base64-encoded) | Manual, on rotation | GitHub Secrets |
| Traccar DB | Backup script (if configured) | See traccar-operations.md | Wasabi |
| OS-level config | Not backed up — rebuild from Ansible | — | Ansible playbook in repo |
| Docker volumes (stateless) | N/A — ephemeral by design | — | — |
Caddy TLS certs (caddy_data) | Not backed up — auto-renewed | — | Re-provisioned on restart |
| Alertmanager silences | Not backed up | — | Ephemeral — acceptable gap |
Server rebuild source of truth: ansible/playbooks/provision-new-vps.yml — provisions full OS baseline, installs Docker, sets up claude-admin, deploys systemd units.
Restore
Scenario 1: Service crash (server intact)
# Monitoring stack
cd /opt/p24-infra/monitoring
docker compose up -d
# OpenClaw
cd /opt/p24-infra/openclaw
docker compose up -dScenario 2: Full server rebuild
# 1. Provision new IONOS VPS (same or replacement IP)
# 2. Run Ansible playbook from local workstation
ansible-playbook ansible/playbooks/provision-new-vps.yml -i <new-ip>,
# 3. Clone repo on new server
ssh root@<new-ip> "git clone https://github.com/radieu/p24-infra /opt/p24-infra"
# 4. Restore .env files from local .env.local
scp -i C:\Users\konar\.ssh\id_ed25519 monitoring/.env root@<new-ip>:/opt/p24-infra/monitoring/.env
# 5. Start stacks
ssh root@<new-ip> "cd /opt/p24-infra/monitoring && docker compose up -d"
ssh root@<new-ip> "cd /opt/p24-infra/openclaw && docker compose up -d"
# 6. Restore Prometheus TSDB from Wasabi (if needed — see monitoring-stack-operations.md)Estimated RTO: ~30 minutes for full service restore (Ansible ~10min, stacks up ~5min, Prometheus data pull optional).
Scenario 3: Prometheus data loss only
See docs/monitoring-stack-operations.md — Restore from Wasabi section.
Healthcheck / Monitoring
| Check | Method | Alert |
|---|---|---|
| Host reachability | Prometheus job node scrapes 217.154.82.162:9100 | ServerDown fires after 5m |
| Container health | Docker healthcheck: directives on all monitoring containers | ContainerCrashLooping rule |
| Disk usage | node_exporter → LowDisk rule (< 10% free) | LowDisk fires |
| Memory usage | node_exporter → HighMemory rule (> 90%) | HighMemory fires |
| CPU usage | node_exporter → HighCPU rule (> 80% 5m avg) | HighCPU fires |
| SSH auth failures | /var/log/secure via node_exporter | SSHAuthFailures rule |
Manual check:
# Container status
ssh -i C:\Users\konar\.ssh\id_ed25519 root@217.154.82.162 \
"cd /opt/p24-infra/monitoring && docker compose ps"
# node_exporter up?
curl http://217.154.82.162:9100/metrics | head -5Password Rotation
SSH key rotation (root + claude-admin)
Rotation frequency: 365 days. Last rotated: see docs/secrets-rotation-log.md.
# 1. Generate new key pair (on local workstation)
ssh-keygen -t ed25519 -f C:\Users\konar\.ssh\id_ed25519_new -C "vps-i1-root-$(date +%F)"
# 2. Add new public key to authorized_keys on vps-i1
ssh -i C:\Users\konar\.ssh\id_ed25519 root@217.154.82.162 \
"echo '<new-public-key>' >> /root/.ssh/authorized_keys"
# 3. Verify new key works
ssh -i C:\Users\konar\.ssh\id_ed25519_new root@217.154.82.162 "hostname"
# 4. Remove old key
ssh -i C:\Users\konar\.ssh\id_ed25519_new root@217.154.82.162 \
"sed -i '/<old-key-fingerprint>/d' /root/.ssh/authorized_keys"
# 5. Replace local key
mv C:\Users\konar\.ssh\id_ed25519_new C:\Users\konar\.ssh\id_ed25519
mv C:\Users\konar\.ssh\id_ed25519_new.pub C:\Users\konar\.ssh\id_ed25519.pub
# 6. Update GH Secret VPS_ROOT_SSH_KEY (base64)
$key = [Convert]::ToBase64String([IO.File]::ReadAllBytes("C:\Users\konar\.ssh\id_ed25519"))
gh secret set VPS_ROOT_SSH_KEY -b $key -R radieu/p24-infra
# 7. Update VPS_SSH_PRIVATE_KEY for claude-admin separately if different key
# 8. Log in docs/secrets-rotation-log.mdGrafana admin / Prometheus basic_auth password
See docs/grafana-operations.md and docs/monitoring-stack-operations.md.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
docker compose ps shows container Exit 1 | Bad config or missing env var | docker compose logs <container> |
| Prometheus targets all DOWN | Prometheus itself restarted | docker compose restart prometheus |
| Caddy 502 | Upstream container not running | docker compose up -d <service> |
| SSH connection refused | sshd crashed or firewall changed | Console login via IONOS panel → systemctl restart sshd |
| Disk full | Log accumulation or Prometheus TSDB | docker system prune -f; extend volume if needed |
| node_exporter unreachable | Systemd service stopped | systemctl restart node_exporter |