Monitoring Stack — Prometheus + Grafana on OVH VPS

Deployment & Configuration Plan

Relation to main plan: Extends design and plan 1.md — specifically Phase 2, Week 6 (Server F) and the Monitoring section. Read that document first for full system context.


Architecture Overview

Monitored Servers (A–E)                    Monitoring Server F (OVH VPS2)
┌──────────────────────────┐               ┌──────────────────────────────────────────┐
│  node_exporter   :9100   │──scrape───────►  Prometheus :9090                        │
│  cAdvisor        :8080   │──scrape───────►    └─ Thanos Sidecar ──► Wasabi S3       │
│  n8n metrics     :5678   │──scrape───────►  Thanos Query :9091                      │
│  queue_exporter  :9200   │──scrape───────►  Alertmanager :9093                      │
└──────────────────────────┘               │  Grafana :3000                           │
                                           │  Caddy :80/:443 (TLS + reverse proxy)    │
                                           └──────────────────────────────────────────┘
                                                        │               │
                                           ┌────────────┘               └──────────────┐
                                           ▼                                            ▼
                                      Wasabi S3                          Supabase PostgreSQL
                                 (long-term TSDB blocks,             (business metrics — queues,
                                  downsampled by Thanos)              incidents, GPS, PDFs)

Why Thanos instead of remote_write? Block-based upload to object storage is more resilient for small infrastructure — no write amplification, native compaction/downsampling built in, and Thanos Query provides a unified PromQL layer over both local and archived data without a separate backend process.


Components

ContainerImageInternal portPurpose
prometheusprom/prometheus:latest9090Metrics collection, local TSDB (15d)
thanos-sidecarquay.io/thanos/thanos:latest10901/10902Uploads 2h TSDB blocks to Wasabi S3
thanos-queryquay.io/thanos/thanos:latest9091Unified PromQL over local + S3 data
alertmanagerprom/alertmanager:latest9093Alert routing → email
grafanagrafana/grafana:latest3000Dashboards — dual data source
queue-exportercustom Python9200Supabase queue depths → Prometheus metrics
caddycaddy:latest80, 443TLS termination, reverse proxy

Exporters on each monitored server (A–E), installed as systemd services:

ExporterPortPurpose
node_exporter9100CPU, RAM, disk, network per VPS
cAdvisor8080Per-container CPU, RAM, restart counts

Repository Structure

monitoring/
├── docker-compose.yml
├── .env                              # secrets — never committed
├── .env.example
├── Caddyfile
├── prometheus/
│   ├── prometheus.yml
│   └── rules/
│       ├── infrastructure.yml
│       └── queues.yml
├── thanos/
│   └── s3.yml                        # Wasabi object store config
├── alertmanager/
│   └── alertmanager.yml
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   ├── prometheus.yml
│       │   └── supabase.yml
│       └── dashboards/
│           ├── dashboards.yml        # provider config
│           ├── infrastructure.json
│           ├── docker-health.json
│           ├── queues.json
│           ├── business-kpis.json
│           └── n8n-workflows.json
├── exporters/
│   └── queue-exporter/
│       ├── Dockerfile
│       ├── requirements.txt
│       └── app.py
└── scripts/
    ├── install-node-exporter.sh      # run on each monitored server
    ├── install-cadvisor.sh
    └── add-scrape-target.sh

Docker Compose

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    restart: unless-stopped
    volumes:
      - ./prometheus:/etc/prometheus:ro
      - prometheus_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=15d
      - --storage.tsdb.min-block-duration=2h
      - --storage.tsdb.max-block-duration=2h
      - --web.enable-lifecycle
    ports:
      - "127.0.0.1:9090:9090"
 
  thanos-sidecar:
    image: quay.io/thanos/thanos:latest
    restart: unless-stopped
    depends_on: [prometheus]
    command:
      - sidecar
      - --tsdb.path=/prometheus
      - --prometheus.url=http://prometheus:9090
      - --grpc-address=0.0.0.0:10901
      - --http-address=0.0.0.0:10902
      - --objstore.config-file=/etc/thanos/s3.yml
    volumes:
      - prometheus_data:/prometheus:ro
      - ./thanos:/etc/thanos:ro
 
  thanos-query:
    image: quay.io/thanos/thanos:latest
    restart: unless-stopped
    depends_on: [thanos-sidecar]
    command:
      - query
      - --http-address=0.0.0.0:9091
      - --grpc-address=0.0.0.0:10903
      - --store=thanos-sidecar:10901
    ports:
      - "127.0.0.1:9091:9091"
 
  alertmanager:
    image: prom/alertmanager:latest
    restart: unless-stopped
    volumes:
      - ./alertmanager:/etc/alertmanager:ro
      - alertmanager_data:/alertmanager
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
      - --storage.path=/alertmanager
    ports:
      - "127.0.0.1:9093:9093"
 
  grafana:
    image: grafana/grafana:latest
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
      - GF_SERVER_ROOT_URL=https://monitoring.ecotrans.eu
      - GF_SERVER_DOMAIN=monitoring.ecotrans.eu
      - GF_SMTP_ENABLED=true
      - GF_SMTP_HOST=${SMTP_HOST}
      - GF_SMTP_USER=${SMTP_USER}
      - GF_SMTP_PASSWORD=${SMTP_PASSWORD}
      - GF_SMTP_FROM_ADDRESS=monitoring@ecotrans.eu
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    ports:
      - "127.0.0.1:3000:3000"
    depends_on: [thanos-query]
 
  queue-exporter:
    build: ./exporters/queue-exporter
    restart: unless-stopped
    environment:
      - SUPABASE_URL=${SUPABASE_URL}
      - SUPABASE_SERVICE_KEY=${SUPABASE_SERVICE_KEY}
    ports:
      - "127.0.0.1:9200:9200"
 
  caddy:
    image: caddy:latest
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro
      - caddy_data:/data
      - caddy_config:/config
 
volumes:
  prometheus_data:
  alertmanager_data:
  grafana_data:
  caddy_data:
  caddy_config:

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: ecotrans-fleet
    env: production
 
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
 
rule_files:
  - /etc/prometheus/rules/*.yml
 
scrape_configs:
  # System metrics — all VPS servers
  - job_name: node
    static_configs:
      - targets:
          - server-a.ecotrans.internal:9100   # IONOS — OpenClaw
          - server-b.ecotrans.internal:9100   # OVH — n8n + Flask + Whisper
          - server-f.ecotrans.internal:9100   # OVH — this monitoring server (self)
 
  # Docker container metrics (cAdvisor)
  - job_name: cadvisor
    static_configs:
      - targets:
          - server-b.ecotrans.internal:8080
          - server-f.ecotrans.internal:8080
 
  # n8n workflow metrics — requires N8N_METRICS=true env on n8n container
  - job_name: n8n
    scrape_interval: 60s
    metrics_path: /metrics
    static_configs:
      - targets:
          - server-b.ecotrans.internal:5678
 
  # Supabase queue depths (custom Python exporter)
  - job_name: queue_depths
    scrape_interval: 60s
    static_configs:
      - targets:
          - localhost:9200
 
  # Self-monitoring
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']
 
  - job_name: thanos
    static_configs:
      - targets: ['thanos-sidecar:10902', 'thanos-query:9091']

Network — public IPs + firewall (decided): Scraping uses the public IP of each server. On every monitored server, firewall rules allow ports 9100 (node_exporter) and 8080 (cAdvisor) from Server F’s IP only — all other sources blocked. No private network needed. In practice, replace the hostnames above with actual public IPs or /etc/hosts aliases on Server F.

Example UFW rule on Server A/B (run once per monitored server):

ufw allow from <SERVER_F_PUBLIC_IP> to any port 9100
ufw allow from <SERVER_F_PUBLIC_IP> to any port 8080

Thanos / Wasabi S3 Configuration

# thanos/s3.yml
type: S3
config:
  bucket: ecotrans-monitoring
  endpoint: s3.eu-central-1.wasabisys.com   # must be explicit — no AWS default
  region: eu-central-1                       # must match Wasabi bucket region
  access_key: ${WASABI_ACCESS_KEY}
  secret_key: ${WASABI_SECRET_KEY}
  insecure: false

Wasabi gotchas (confirmed via Perplexity research):

  • The endpoint field is mandatory — Thanos defaults to AWS if omitted and will fail against Wasabi
  • region must match the region the Wasabi bucket was created in
  • Newly uploaded blocks are eventually consistent — allow 5–10 minutes before querying archived data via Thanos Query
  • If scrapes work against AWS S3 but not Wasabi, the problem is always one of: endpoint, region, or DNS addressing style

Data retention tiers:

TierLocationResolutionRetention
RawLocal Prometheus TSDB15s15 days
Raw blocksWasabi S3 (uploaded by sidecar every 2h)15s30 days
Downsampled 5mWasabi S3 (Thanos compact)5 min90 days
Downsampled 1hWasabi S3 (Thanos compact)1 hour1 year

Estimated Wasabi cost: ~3–8 GB compressed metrics/month → < 1€/month.

Thanos compaction (run weekly as one-off, not persistent container — saves RAM):

docker run --rm \
  -v $(pwd)/thanos:/etc/thanos \
  quay.io/thanos/thanos:latest \
  compact \
  --objstore.config-file=/etc/thanos/s3.yml \
  --retention.resolution-raw=30d \
  --retention.resolution-5m=90d \
  --retention.resolution-1h=365d \
  --wait

Alerting Rules

# prometheus/rules/infrastructure.yml
groups:
  - name: servers
    rules:
      - alert: ServerDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Server {{ $labels.instance }} is unreachable"
          description: "node_exporter on {{ $labels.instance }} down for 2+ minutes"
 
      - alert: HighCPU
        expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}: {{ $value | printf \"%.0f\" }}%"
 
      - alert: LowDisk
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk on {{ $labels.instance }}: {{ $value | printf \"%.0f\" }}% free"
 
      - alert: HighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory on {{ $labels.instance }}: {{ $value | printf \"%.0f\" }}% used"
 
  - name: containers
    rules:
      - alert: ContainerHighRestarts
        expr: increase(container_restart_count{name!=""}[1h]) > 3
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} restarted {{ $value | printf \"%.0f\" }}× in 1h"
 
      - alert: ContainerHighRestartsCritical
        expr: increase(container_restart_count{name!=""}[30m]) > 5
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} crash-looping"
# prometheus/rules/queues.yml
groups:
  - name: processing_queues
    rules:
      - alert: TranscriptionQueueWarning
        expr: supabase_queue_depth{queue="pending_transcriptions",status="pending"} > 50
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Transcription queue: {{ $value | printf \"%.0f\" }} pending jobs"
 
      - alert: TranscriptionQueueCritical
        expr: supabase_queue_depth{queue="pending_transcriptions",status="pending"} > 200
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Transcription queue critical: {{ $value | printf \"%.0f\" }} pending"
 
      - alert: PDFQueueWarning
        expr: supabase_queue_depth{queue="pending_pdf_processing",status="pending"} > 30
        for: 10m
        labels:
          severity: warning
 
      - alert: StuckProcessingJobs
        expr: supabase_queue_depth{status="processing"} > 0
        for: 35m
        labels:
          severity: warning
        annotations:
          summary: "Jobs stuck in 'processing' for >35m on {{ $labels.queue }} — timeout logic may have failed"

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: '${SMTP_HOST}:587'
  smtp_from: 'monitoring@ecotrans.eu'
  smtp_auth_username: '${SMTP_USER}'
  smtp_auth_password: '${SMTP_PASSWORD}'
  smtp_require_tls: true
 
route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: email-default
 
  routes:
    - matchers:
        - severity = critical
      receiver: email-critical
      repeat_interval: 1h
 
    - matchers:
        - severity = warning
      receiver: email-warning-digest
      group_wait: 4h
      group_interval: 12h
      repeat_interval: 24h
 
receivers:
  - name: email-critical
    email_configs:
      - to: 'radieu@gmail.com'
        subject: '[CRITICAL] {{ .GroupLabels.alertname }} — Ecotrans Fleet'
        send_resolved: true
 
  - name: email-default
    email_configs:
      - to: 'radieu@gmail.com'
        subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        send_resolved: true
 
  - name: email-warning-digest
    email_configs:
      - to: 'radieu@gmail.com'
        subject: '[WARNING Digest] Ecotrans Fleet Monitoring'
        send_resolved: false
 
inhibit_rules:
  # Critical on same instance suppresses its own warnings
  - source_matchers: [severity="critical"]
    target_matchers: [severity="warning"]
    equal: [instance]

Grafana Provisioning

Data Sources

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus
    url: http://thanos-query:9091
    access: proxy
    isDefault: true
    jsonData:
      timeInterval: 15s
# grafana/provisioning/datasources/supabase.yml
apiVersion: 1
datasources:
  - name: Supabase
    type: postgres
    uid: supabase
    # Use direct connection :5432, NOT pooler :6543
    # Pooler (Supavisor) is for short-lived connections; Grafana holds persistent connections
    url: ${SUPABASE_DB_HOST}:5432
    database: postgres
    user: grafana_readonly
    secureJsonData:
      password: ${SUPABASE_GRAFANA_PASSWORD}
    jsonData:
      sslmode: require       # 'require' is sufficient; 'verify-full' needs CA cert file
      maxOpenConns: 5
      maxIdleConns: 2
      connMaxLifetime: 14400
      postgresVersion: 1500
      timescaledb: false

Supabase — create read-only role (SQL Editor):

CREATE ROLE grafana_readonly WITH LOGIN PASSWORD 'strong-password-here';
GRANT USAGE ON SCHEMA public TO grafana_readonly;
GRANT SELECT ON
  pending_transcriptions,
  pending_pdf_processing,
  incidents,
  fleet_positions,
  whatsapp_messages
TO grafana_readonly;

Caddy Reverse Proxy

# Caddyfile
monitoring.ecotrans.eu {
    reverse_proxy grafana:3000
    encode gzip
}

Caddy auto-provisions TLS via Let’s Encrypt. No manual certificate management needed.


Custom Queue Depth Exporter

Lightweight Python service that queries Supabase every 60s and exposes metrics in Prometheus format.

# exporters/queue-exporter/app.py
import os
import time
from prometheus_client import start_http_server, Gauge
from supabase import create_client
 
QUEUE_DEPTH = Gauge(
    'supabase_queue_depth',
    'Number of jobs by queue table and status',
    ['queue', 'status']
)
 
QUEUES = ['pending_transcriptions', 'pending_pdf_processing']
STATUSES = ['pending', 'processing', 'failed', 'failed_permanently', 'completed']
 
def collect():
    client = create_client(os.environ['SUPABASE_URL'], os.environ['SUPABASE_SERVICE_KEY'])
    for table in QUEUES:
        for status in STATUSES:
            try:
                result = client.table(table).select('id', count='exact').eq('status', status).execute()
                QUEUE_DEPTH.labels(queue=table, status=status).set(result.count or 0)
            except Exception as e:
                print(f"Error collecting {table}/{status}: {e}")
 
if __name__ == '__main__':
    start_http_server(9200)
    print("Queue exporter running on :9200")
    while True:
        collect()
        time.sleep(60)
# exporters/queue-exporter/Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
CMD ["python", "app.py"]
# exporters/queue-exporter/requirements.txt
prometheus_client==0.20.0
supabase==2.4.0

n8n Metrics Setup

On Server B, add these env vars to the n8n container and recreate (not just restart):

# server-b docker-compose.yml — n8n service additions
services:
  n8n:
    environment:
      - N8N_METRICS=true
      - N8N_METRICS_INCLUDE_QUEUE_METRICS=true
      - N8N_METRICS_QUEUE_METRICS_INTERVAL=5000

Verify with: curl http://server-b:5678/metrics | grep n8n_

Common pitfall: Container must be recreated (docker compose up -d --force-recreate n8n), not just restarted — env vars don’t update on docker compose restart.


Dashboard Specifications

Dashboard 1: Infrastructure Overview

  • Stat row: All servers UP/DOWN (colored green/red)
  • Time series: CPU % per server (5m avg)
  • Time series: RAM used % per server
  • Gauge: Disk free % per server (threshold: warn <20%, critical <10%)
  • Time series: Network I/O (Mbps) per server

Dashboard 2: Docker Container Health

  • Table: All containers — name | status | restart count 1h | CPU% | RAM MB
  • Time series: Container restart events over 24h
  • Time series: Per-service CPU and RAM (n8n, flask-pdf, faster-whisper)

Dashboard 3: Processing Queues (dual source)

  • Big stat: Current pending transcriptions (Prometheus — queue-exporter)
  • Big stat: Current pending PDF jobs (Prometheus — queue-exporter)
  • Time series: Queue depth over time by status (pending / processing / failed)
  • Bar chart: Jobs completed per hour (Supabase SQL):
    SELECT
      date_trunc('hour', updated_at) AS time,
      COUNT(*) AS completed
    FROM pending_transcriptions
    WHERE status = 'completed' AND updated_at > NOW() - INTERVAL '24 hours'
    GROUP BY 1 ORDER BY 1
  • Stat: Stuck jobs (processing > 35 min)
  • Stat: Failed permanently last 24h

Dashboard 4: Business KPIs (Supabase SQL panels)

  • Stat: Incidents created today vs yesterday
  • Time series: Incidents per day (30 days rolling):
    SELECT date_trunc('day', created_at) AS time, COUNT(*) AS incidents
    FROM incidents
    WHERE created_at > NOW() - INTERVAL '30 days'
    GROUP BY 1 ORDER BY 1
  • Stat: GPS coverage — % of fleet updated in last 10 minutes:
    SELECT ROUND(
      100.0 * COUNT(*) FILTER (WHERE last_updated > NOW() - INTERVAL '10 minutes')
      / NULLIF(COUNT(*), 0), 1
    ) AS coverage_pct
    FROM fleet_positions
  • Time series: WhatsApp messages received per hour
  • Time series: PDFs processed per day

Dashboard 5: n8n Workflow Health (Prometheus)

  • Stat: Workflow success rate last 24h
  • Time series: Execution counts by workflow name
  • Time series: Execution duration p50/p95
  • Table: Failed workflows with error type and count

Claude Code SSH Access Model

Initial server setup

# Run on Server F during first provisioning
adduser claude-admin --disabled-password
usermod -aG docker claude-admin
 
# Sudoers — scoped to docker and systemctl only
cat > /etc/sudoers.d/claude-admin <<EOF
claude-admin ALL=(ALL) NOPASSWD: /usr/bin/docker, /usr/local/bin/docker, /usr/bin/systemctl
EOF
 
# Add Claude Code's public key
mkdir -p /home/claude-admin/.ssh
echo "ssh-ed25519 AAAA...KEY... claude-code-monitoring" >> /home/claude-admin/.ssh/authorized_keys
chmod 700 /home/claude-admin/.ssh && chmod 600 /home/claude-admin/.ssh/authorized_keys

The Claude Code SSH private key is stored as a session environment variable (VPS_MONITORING_SSH_KEY) — never committed to the repo.

Common admin operations

# Check stack health
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose ps"
 
# View live logs
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose logs --tail=100 grafana"
 
# Reload Prometheus config (hot reload — no downtime)
ssh claude-admin@<server-f-ip> "curl -X POST http://localhost:9090/-/reload"
 
# Restart single service
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose restart alertmanager"
 
# Pull latest images and redeploy
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose pull && docker compose up -d"
 
# Push updated config file (example: alert rules)
scp ./prometheus/rules/queues.yml claude-admin@<ip>:/opt/monitoring/prometheus/rules/queues.yml
ssh claude-admin@<ip> "curl -X POST http://localhost:9090/-/reload"
 
# Check Thanos Wasabi upload status
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose logs thanos-sidecar | grep -E 'uploaded|error|failed'"
 
# Run Thanos compaction on-demand
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker run --rm \
  -v \$(pwd)/thanos:/etc/thanos \
  quay.io/thanos/thanos:latest \
  compact --objstore.config-file=/etc/thanos/s3.yml \
  --retention.resolution-raw=30d \
  --retention.resolution-5m=90d \
  --retention.resolution-1h=365d \
  --wait"

Installing exporters on monitored servers (scripts)

# scripts/install-node-exporter.sh — run on Server A, B, etc.
NODE_EXPORTER_VERSION=1.8.0
wget -qO- https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz | tar xz
sudo cp node_exporter-*/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter 2>/dev/null || true
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter --web.listen-address=0.0.0.0:9100
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload && sudo systemctl enable --now node_exporter
echo "node_exporter running: $(systemctl is-active node_exporter)"
# scripts/install-cadvisor.sh — run on Docker hosts (B, F)
docker run -d \
  --restart=unless-stopped \
  --name=cadvisor \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --publish=127.0.0.1:8080:8080 \
  gcr.io/cadvisor/cadvisor:latest

Environment Variables

# .env.example — copy to .env and fill on server; NEVER commit .env
GRAFANA_ADMIN_PASSWORD=
 
# SMTP — Mailgun EU region (smtp.eu.mailgun.org for EU-hosted accounts)
SMTP_HOST=smtp.eu.mailgun.org
SMTP_USER=
SMTP_PASSWORD=
 
# Server F public IP — used in firewall rules on monitored servers
SERVER_F_PUBLIC_IP=
 
# Wasabi S3 (scoped IAM key for ecotrans-monitoring bucket only)
WASABI_ACCESS_KEY=
WASABI_SECRET_KEY=
 
# Supabase
SUPABASE_URL=https://<project-ref>.supabase.co
SUPABASE_SERVICE_KEY=
SUPABASE_DB_HOST=db.<project-ref>.supabase.co
SUPABASE_GRAFANA_PASSWORD=

Security Model

LayerRule
OVH/VPS firewallOpen globally: 22, 80, 443. Open to Server F IP only: 9100, 8080 on monitored servers
All containersBind to 127.0.0.1 — no direct external access
CaddyTLS for all external traffic; auto-renew via Let’s Encrypt
Grafana externalhttps://monitoring.ecotrans.eu — login required
Prometheus / AlertmanagerNot exposed externally — Grafana queries internally
SSH accessKey-based only; PasswordAuthentication no; PermitRootLogin no
Supabase rolegrafana_readonly — SELECT only on 5 specific tables
Wasabi IAMScoped key with access to ecotrans-monitoring bucket only
SecretsAll in .env on server — never committed to git

Daily Health Report (n8n Workflow)

Grafana OSS doesn’t support scheduled email reports — this is handled by an n8n workflow running at 08:00 daily.

Workflow: monitoring-daily-report

Cron (08:00 daily)
  → HTTP GET Grafana snapshot API (create snapshot for each dashboard)
  → HTTP GET screenshot via Grafana image renderer (if installed)
  → Build HTML email with:
      - Infrastructure status summary (query Prometheus API)
      - Queue depths (query Prometheus API)
      - Incident count last 24h (query Supabase)
      - Any active alerts (query Alertmanager API)
  → Send email via Mailgun

Prometheus API queries for the email (HTTP Request nodes):

# All targets status
GET http://<server-f-ip>:9090/api/v1/query?query=up

# Active alerts
GET http://<server-f-ip>:9093/api/v2/alerts

# Queue depths
GET http://<server-f-ip>:9090/api/v1/query?query=supabase_queue_depth{status="pending"}

These internal calls go from n8n (Server B) to Server F’s ports 9090 and 9093. Add firewall rule: allow Server B IP → port 9090, 9093 on Server F.

Alternative (simpler): Install grafana-image-renderer plugin into the Grafana container. Then use Grafana’s /render/d/<dashboard-id> endpoint to generate PNG screenshots, attach to email. No custom queries needed.

# grafana service addition in docker-compose.yml
  grafana:
    environment:
      - GF_RENDERING_SERVER_URL=http://renderer:8081/render
      - GF_RENDERING_CALLBACK_URL=http://grafana:3000/
    depends_on: [renderer]
 
  renderer:
    image: grafana/grafana-image-renderer:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:8081:8081"

Testing Environment

For initial setup and testing, use the existing VPS defined in .env.local instead of provisioning OVH Server F.

The .env.local file (in this repo, not committed) contains:

TEST_VPS_HOST=        # IP or hostname of the test VPS
TEST_VPS_USER=        # SSH user (claude-admin or root for initial setup)
TEST_VPS_SSH_KEY=     # path to SSH private key

What changes for the test environment:

  • Same docker-compose.yml and configs — no test-specific variants
  • Skip DNS (monitoring.ecotrans.eu) — access Grafana via http://<test-vps-ip>:3000 directly (or via Caddy with a self-signed cert)
  • Wasabi bucket: use a separate ecotrans-monitoring-test bucket to avoid polluting production TSDB blocks
  • Monitored targets: only the test VPS itself (self-monitoring) + Server B if accessible; skip IONOS server A
  • The test VPS is treated as a throwaway — recreate freely

Switching target in Claude Code sessions:

# Test VPS
ssh $TEST_VPS_USER@$TEST_VPS_HOST "cd /opt/monitoring && docker compose ps"
 
# Production (OVH Server F, later)
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose ps"

Implementation Checklist

Phase 0 — Test environment (on existing VPS from .env.local)

  • SSH into test VPS, install Docker + Docker Compose v2
  • Create Wasabi bucket ecotrans-monitoring-test
  • Deploy full stack (docker compose up -d)
  • Install node_exporter + cAdvisor on the test VPS (self-monitoring)
  • Verify Prometheus scrapes self → Grafana shows data
  • Verify Thanos uploads to ecotrans-monitoring-test bucket
  • Build and validate all 5 dashboards
  • Test alert pipeline: stop node_exporter → email received
  • Test n8n daily report workflow

Phase 1 — Production server provisioning (OVH Server F)

  • Provision OVH VPS2 Server F (Ubuntu 24.04 LTS, 6 cores, 12GB RAM)
  • Create claude-admin user, add SSH key, configure sudoers
  • Install Docker + Docker Compose v2
  • OVH firewall: allow 22, 80, 443
  • Create Wasabi bucket ecotrans-monitoring in eu-central-1
  • Create Wasabi IAM user scoped to that bucket; copy keys to .env
  • Create Supabase grafana_readonly role (SQL above)
  • DNS: monitoring.ecotrans.eu → Server F IP

Phase 2 — Core stack

  • Clone monitoring repo to /opt/monitoring on Server F
  • Populate .env with all secrets
  • Validate Prometheus config: docker run --rm -v $(pwd)/prometheus:/etc/prometheus prom/prometheus:latest promtool check config /etc/prometheus/prometheus.yml
  • docker compose up -d
  • Verify Prometheus UI: curl http://localhost:9090/-/healthy
  • Verify Thanos sidecar: curl http://localhost:10902/-/healthy
  • Verify Grafana: https://monitoring.ecotrans.eu
  • After ~2h: confirm first block uploaded to Wasabi

Phase 3 — Exporters on monitored servers

  • Run install-node-exporter.sh on Server A (IONOS)
  • Run install-node-exporter.sh on Server B (OVH)
  • Run install-cadvisor.sh on Server B
  • Run install-cadvisor.sh on Server F (self)
  • Enable n8n metrics on Server B: add env vars, recreate container
  • Verify all targets green: Prometheus → Status → Targets
  • Verify queue-exporter: curl http://localhost:9200/metrics | grep supabase_queue

Phase 4 — Grafana dashboards

  • Confirm Prometheus data source: test query up
  • Confirm Supabase data source: test query SELECT 1
  • Build Dashboard 1 (Infrastructure Overview)
  • Build Dashboard 2 (Docker Health)
  • Build Dashboard 3 (Processing Queues)
  • Build Dashboard 4 (Business KPIs)
  • Build Dashboard 5 (n8n Workflows)
  • Export all dashboard JSON → commit to grafana/provisioning/dashboards/

Phase 5 — Alerting

  • Validate rules: promtool check rules prometheus/rules/*.yml
  • Test P1: manually stop node_exporter → confirm email within 3 min
  • Test P2: simulate warning → confirm digest batching works
  • Set up weekly Thanos compaction cron on Server F

Phase 6 — Hardening

  • Add Docker log rotation to compose (json-file driver, max 50MB × 3)
  • Document runbook: “container down”, “disk full”, “Wasabi upload stalled”
  • Verify OVH VPS snapshots enabled (monthly)

Cost Summary

ComponentCost
OVH VPS2 Server F (6 cores, 12GB RAM)7€/month
Wasabi S3 ecotrans-monitoring~1€/month
monitoring.ecotrans.eu subdomainincluded in domain
TLS certificate (Caddy + Let’s Encrypt)free
Total monitoring stack~8€/month

Architectural Decisions

#DecisionRationale
1Network: public IPs + UFW firewall rulesSimplest — no private network provisioning; firewall on each server allows ports 9100/8080 from Server F IP only
2Daily report: n8n workflowGrafana OSS has no scheduled reports; n8n already in stack — see Daily Health Report section above
3SMTP: Mailgun EUAlready in stack; EU region host is smtp.eu.mailgun.org
4Testing: existing VPS from .env.localAvoids early OVH cost; full stack tested before production provisioning — see Testing Environment section
5Thanos compact: weekly one-off Docker run12GB VPS can’t spare ~2GB RAM for persistent compactor; cron job is sufficient for this data volume