Monitoring Stack — Prometheus + Grafana on OVH VPS
Deployment & Configuration Plan
Relation to main plan: Extends design and plan 1.md — specifically Phase 2, Week 6 (Server F) and the Monitoring section. Read that document first for full system context.
Why Thanos instead of remote_write?
Block-based upload to object storage is more resilient for small infrastructure — no write amplification, native compaction/downsampling built in, and Thanos Query provides a unified PromQL layer over both local and archived data without a separate backend process.
Components
Container
Image
Internal port
Purpose
prometheus
prom/prometheus:latest
9090
Metrics collection, local TSDB (15d)
thanos-sidecar
quay.io/thanos/thanos:latest
10901/10902
Uploads 2h TSDB blocks to Wasabi S3
thanos-query
quay.io/thanos/thanos:latest
9091
Unified PromQL over local + S3 data
alertmanager
prom/alertmanager:latest
9093
Alert routing → email
grafana
grafana/grafana:latest
3000
Dashboards — dual data source
queue-exporter
custom Python
9200
Supabase queue depths → Prometheus metrics
caddy
caddy:latest
80, 443
TLS termination, reverse proxy
Exporters on each monitored server (A–E), installed as systemd services:
Network — public IPs + firewall (decided): Scraping uses the public IP of each server. On every monitored server, firewall rules allow ports 9100 (node_exporter) and 8080 (cAdvisor) from Server F’s IP only — all other sources blocked. No private network needed. In practice, replace the hostnames above with actual public IPs or /etc/hosts aliases on Server F.
Example UFW rule on Server A/B (run once per monitored server):
ufw allow from <SERVER_F_PUBLIC_IP> to any port 9100ufw allow from <SERVER_F_PUBLIC_IP> to any port 8080
Thanos / Wasabi S3 Configuration
# thanos/s3.ymltype: S3config: bucket: ecotrans-monitoring endpoint: s3.eu-central-1.wasabisys.com # must be explicit — no AWS default region: eu-central-1 # must match Wasabi bucket region access_key: ${WASABI_ACCESS_KEY} secret_key: ${WASABI_SECRET_KEY} insecure: false
Wasabi gotchas (confirmed via Perplexity research):
The endpoint field is mandatory — Thanos defaults to AWS if omitted and will fail against Wasabi
region must match the region the Wasabi bucket was created in
Newly uploaded blocks are eventually consistent — allow 5–10 minutes before querying archived data via Thanos Query
If scrapes work against AWS S3 but not Wasabi, the problem is always one of: endpoint, region, or DNS addressing style
# grafana/provisioning/datasources/supabase.ymlapiVersion: 1datasources: - name: Supabase type: postgres uid: supabase # Use direct connection :5432, NOT pooler :6543 # Pooler (Supavisor) is for short-lived connections; Grafana holds persistent connections url: ${SUPABASE_DB_HOST}:5432 database: postgres user: grafana_readonly secureJsonData: password: ${SUPABASE_GRAFANA_PASSWORD} jsonData: sslmode: require # 'require' is sufficient; 'verify-full' needs CA cert file maxOpenConns: 5 maxIdleConns: 2 connMaxLifetime: 14400 postgresVersion: 1500 timescaledb: false
Supabase — create read-only role (SQL Editor):
CREATE ROLE grafana_readonly WITH LOGIN PASSWORD 'strong-password-here';GRANT USAGE ON SCHEMA public TO grafana_readonly;GRANT SELECT ON pending_transcriptions, pending_pdf_processing, incidents, fleet_positions, whatsapp_messagesTO grafana_readonly;
Common pitfall: Container must be recreated (docker compose up -d --force-recreate n8n), not just restarted — env vars don’t update on docker compose restart.
Dashboard Specifications
Dashboard 1: Infrastructure Overview
Stat row: All servers UP/DOWN (colored green/red)
Time series: CPU % per server (5m avg)
Time series: RAM used % per server
Gauge: Disk free % per server (threshold: warn <20%, critical <10%)
Time series: Network I/O (Mbps) per server
Dashboard 2: Docker Container Health
Table: All containers — name | status | restart count 1h | CPU% | RAM MB
Time series: Container restart events over 24h
Time series: Per-service CPU and RAM (n8n, flask-pdf, faster-whisper)
Dashboard 3: Processing Queues (dual source)
Big stat: Current pending transcriptions (Prometheus — queue-exporter)
Big stat: Current pending PDF jobs (Prometheus — queue-exporter)
Time series: Queue depth over time by status (pending / processing / failed)
Bar chart: Jobs completed per hour (Supabase SQL):
SELECT date_trunc('hour', updated_at) AS time, COUNT(*) AS completedFROM pending_transcriptionsWHERE status = 'completed' AND updated_at > NOW() - INTERVAL '24 hours'GROUP BY 1 ORDER BY 1
Stat: Stuck jobs (processing > 35 min)
Stat: Failed permanently last 24h
Dashboard 4: Business KPIs (Supabase SQL panels)
Stat: Incidents created today vs yesterday
Time series: Incidents per day (30 days rolling):
SELECT date_trunc('day', created_at) AS time, COUNT(*) AS incidentsFROM incidentsWHERE created_at > NOW() - INTERVAL '30 days'GROUP BY 1 ORDER BY 1
Stat: GPS coverage — % of fleet updated in last 10 minutes:
# Run on Server F during first provisioningadduser claude-admin --disabled-passwordusermod -aG docker claude-admin# Sudoers — scoped to docker and systemctl onlycat > /etc/sudoers.d/claude-admin <<EOFclaude-admin ALL=(ALL) NOPASSWD: /usr/bin/docker, /usr/local/bin/docker, /usr/bin/systemctlEOF# Add Claude Code's public keymkdir -p /home/claude-admin/.sshecho "ssh-ed25519 AAAA...KEY... claude-code-monitoring" >> /home/claude-admin/.ssh/authorized_keyschmod 700 /home/claude-admin/.ssh && chmod 600 /home/claude-admin/.ssh/authorized_keys
The Claude Code SSH private key is stored as a session environment variable (VPS_MONITORING_SSH_KEY) — never committed to the repo.
Installing exporters on monitored servers (scripts)
# scripts/install-node-exporter.sh — run on Server A, B, etc.NODE_EXPORTER_VERSION=1.8.0wget -qO- https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz | tar xzsudo cp node_exporter-*/node_exporter /usr/local/bin/sudo useradd --no-create-home --shell /bin/false node_exporter 2>/dev/null || truesudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF[Unit]Description=Prometheus Node ExporterAfter=network.target[Service]User=node_exporterExecStart=/usr/local/bin/node_exporter --web.listen-address=0.0.0.0:9100Restart=always[Install]WantedBy=multi-user.targetEOFsudo systemctl daemon-reload && sudo systemctl enable --now node_exporterecho "node_exporter running: $(systemctl is-active node_exporter)"
# scripts/install-cadvisor.sh — run on Docker hosts (B, F)docker run -d \ --restart=unless-stopped \ --name=cadvisor \ --volume=/:/rootfs:ro \ --volume=/var/run:/var/run:ro \ --volume=/sys:/sys:ro \ --volume=/var/lib/docker/:/var/lib/docker:ro \ --publish=127.0.0.1:8080:8080 \ gcr.io/cadvisor/cadvisor:latest
Environment Variables
# .env.example — copy to .env and fill on server; NEVER commit .envGRAFANA_ADMIN_PASSWORD=# SMTP — Mailgun EU region (smtp.eu.mailgun.org for EU-hosted accounts)SMTP_HOST=smtp.eu.mailgun.orgSMTP_USER=SMTP_PASSWORD=# Server F public IP — used in firewall rules on monitored serversSERVER_F_PUBLIC_IP=# Wasabi S3 (scoped IAM key for ecotrans-monitoring bucket only)WASABI_ACCESS_KEY=WASABI_SECRET_KEY=# SupabaseSUPABASE_URL=https://<project-ref>.supabase.coSUPABASE_SERVICE_KEY=SUPABASE_DB_HOST=db.<project-ref>.supabase.coSUPABASE_GRAFANA_PASSWORD=
Security Model
Layer
Rule
OVH/VPS firewall
Open globally: 22, 80, 443. Open to Server F IP only: 9100, 8080 on monitored servers
All containers
Bind to 127.0.0.1 — no direct external access
Caddy
TLS for all external traffic; auto-renew via Let’s Encrypt
Grafana external
https://monitoring.ecotrans.eu — login required
Prometheus / Alertmanager
Not exposed externally — Grafana queries internally
SSH access
Key-based only; PasswordAuthentication no; PermitRootLogin no
Supabase role
grafana_readonly — SELECT only on 5 specific tables
Wasabi IAM
Scoped key with access to ecotrans-monitoring bucket only
Secrets
All in .env on server — never committed to git
Daily Health Report (n8n Workflow)
Grafana OSS doesn’t support scheduled email reports — this is handled by an n8n workflow running at 08:00 daily.
Workflow: monitoring-daily-report
Cron (08:00 daily)
→ HTTP GET Grafana snapshot API (create snapshot for each dashboard)
→ HTTP GET screenshot via Grafana image renderer (if installed)
→ Build HTML email with:
- Infrastructure status summary (query Prometheus API)
- Queue depths (query Prometheus API)
- Incident count last 24h (query Supabase)
- Any active alerts (query Alertmanager API)
→ Send email via Mailgun
Prometheus API queries for the email (HTTP Request nodes):
# All targets status
GET http://<server-f-ip>:9090/api/v1/query?query=up
# Active alerts
GET http://<server-f-ip>:9093/api/v2/alerts
# Queue depths
GET http://<server-f-ip>:9090/api/v1/query?query=supabase_queue_depth{status="pending"}
These internal calls go from n8n (Server B) to Server F’s ports 9090 and 9093. Add firewall rule: allow Server B IP → port 9090, 9093 on Server F.
Alternative (simpler): Install grafana-image-renderer plugin into the Grafana container. Then use Grafana’s /render/d/<dashboard-id> endpoint to generate PNG screenshots, attach to email. No custom queries needed.
For initial setup and testing, use the existing VPS defined in .env.local instead of provisioning OVH Server F.
The .env.local file (in this repo, not committed) contains:
TEST_VPS_HOST= # IP or hostname of the test VPSTEST_VPS_USER= # SSH user (claude-admin or root for initial setup)TEST_VPS_SSH_KEY= # path to SSH private key
What changes for the test environment:
Same docker-compose.yml and configs — no test-specific variants
Skip DNS (monitoring.ecotrans.eu) — access Grafana via http://<test-vps-ip>:3000 directly (or via Caddy with a self-signed cert)
Wasabi bucket: use a separate ecotrans-monitoring-test bucket to avoid polluting production TSDB blocks
Monitored targets: only the test VPS itself (self-monitoring) + Server B if accessible; skip IONOS server A
The test VPS is treated as a throwaway — recreate freely
Switching target in Claude Code sessions:
# Test VPSssh $TEST_VPS_USER@$TEST_VPS_HOST "cd /opt/monitoring && docker compose ps"# Production (OVH Server F, later)ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose ps"
Implementation Checklist
Phase 0 — Test environment (on existing VPS from .env.local)
SSH into test VPS, install Docker + Docker Compose v2
Create Wasabi bucket ecotrans-monitoring-test
Deploy full stack (docker compose up -d)
Install node_exporter + cAdvisor on the test VPS (self-monitoring)
Verify Prometheus scrapes self → Grafana shows data
Verify Thanos uploads to ecotrans-monitoring-test bucket
Build and validate all 5 dashboards
Test alert pipeline: stop node_exporter → email received
Test n8n daily report workflow
Phase 1 — Production server provisioning (OVH Server F)
Provision OVH VPS2 Server F (Ubuntu 24.04 LTS, 6 cores, 12GB RAM)