Monitoring Stack — Prometheus + Grafana on OVH VPS

Deployment & Configuration Plan

Relation to main plan: Extends design and plan 1.md — specifically Phase 2, Week 6 (Server F) and the Monitoring section. Read that document first for full system context.

Architecture Overview

Monitored Servers (A–E)                    Monitoring Server F (OVH VPS2)
┌──────────────────────────┐               ┌──────────────────────────────────────────┐
│  node_exporter   :9100   │──scrape───────►  Prometheus :9090                        │
│  cAdvisor        :8080   │──scrape───────►    └─ Thanos Sidecar ──► Wasabi S3       │
│  n8n metrics     :5678   │──scrape───────►  Thanos Query :9091                      │
│  queue_exporter  :9200   │──scrape───────►  Alertmanager :9093                      │
└──────────────────────────┘               │  Grafana :3000                           │
                                           │  Caddy :80/:443 (TLS + reverse proxy)    │
                                           └──────────────────────────────────────────┘
                                                        │               │
                                           ┌────────────┘               └──────────────┐
                                           ▼                                            ▼
                                      Wasabi S3                          Supabase PostgreSQL
                                 (long-term TSDB blocks,             (business metrics — queues,
                                  downsampled by Thanos)              incidents, GPS, PDFs)

Why Thanos instead of remote_write? Block-based upload to object storage is more resilient for small infrastructure — no write amplification, native compaction/downsampling built in, and Thanos Query provides a unified PromQL layer over both local and archived data without a separate backend process.

Components

Container	Image	Internal port	Purpose
`prometheus`	`prom/prometheus:latest`	9090	Metrics collection, local TSDB (15d)
`thanos-sidecar`	`quay.io/thanos/thanos:latest`	10901/10902	Uploads 2h TSDB blocks to Wasabi S3
`thanos-query`	`quay.io/thanos/thanos:latest`	9091	Unified PromQL over local + S3 data
`alertmanager`	`prom/alertmanager:latest`	9093	Alert routing → email
`grafana`	`grafana/grafana:latest`	3000	Dashboards — dual data source
`queue-exporter`	custom Python	9200	Supabase queue depths → Prometheus metrics
`caddy`	`caddy:latest`	80, 443	TLS termination, reverse proxy

Exporters on each monitored server (A–E), installed as systemd services:

Exporter	Port	Purpose
`node_exporter`	9100	CPU, RAM, disk, network per VPS
`cAdvisor`	8080	Per-container CPU, RAM, restart counts

Repository Structure

monitoring/
├── docker-compose.yml
├── .env                              # secrets — never committed
├── .env.example
├── Caddyfile
├── prometheus/
│   ├── prometheus.yml
│   └── rules/
│       ├── infrastructure.yml
│       └── queues.yml
├── thanos/
│   └── s3.yml                        # Wasabi object store config
├── alertmanager/
│   └── alertmanager.yml
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   ├── prometheus.yml
│       │   └── supabase.yml
│       └── dashboards/
│           ├── dashboards.yml        # provider config
│           ├── infrastructure.json
│           ├── docker-health.json
│           ├── queues.json
│           ├── business-kpis.json
│           └── n8n-workflows.json
├── exporters/
│   └── queue-exporter/
│       ├── Dockerfile
│       ├── requirements.txt
│       └── app.py
└── scripts/
    ├── install-node-exporter.sh      # run on each monitored server
    ├── install-cadvisor.sh
    └── add-scrape-target.sh

Docker Compose

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    restart: unless-stopped
    volumes:
      - ./prometheus:/etc/prometheus:ro
      - prometheus_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=15d
      - --storage.tsdb.min-block-duration=2h
      - --storage.tsdb.max-block-duration=2h
      - --web.enable-lifecycle
    ports:
      - "127.0.0.1:9090:9090"
 
  thanos-sidecar:
    image: quay.io/thanos/thanos:latest
    restart: unless-stopped
    depends_on: [prometheus]
    command:
      - sidecar
      - --tsdb.path=/prometheus
      - --prometheus.url=http://prometheus:9090
      - --grpc-address=0.0.0.0:10901
      - --http-address=0.0.0.0:10902
      - --objstore.config-file=/etc/thanos/s3.yml
    volumes:
      - prometheus_data:/prometheus:ro
      - ./thanos:/etc/thanos:ro
 
  thanos-query:
    image: quay.io/thanos/thanos:latest
    restart: unless-stopped
    depends_on: [thanos-sidecar]
    command:
      - query
      - --http-address=0.0.0.0:9091
      - --grpc-address=0.0.0.0:10903
      - --store=thanos-sidecar:10901
    ports:
      - "127.0.0.1:9091:9091"
 
  alertmanager:
    image: prom/alertmanager:latest
    restart: unless-stopped
    volumes:
      - ./alertmanager:/etc/alertmanager:ro
      - alertmanager_data:/alertmanager
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
      - --storage.path=/alertmanager
    ports:
      - "127.0.0.1:9093:9093"
 
  grafana:
    image: grafana/grafana:latest
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
      - GF_SERVER_ROOT_URL=https://monitoring.ecotrans.eu
      - GF_SERVER_DOMAIN=monitoring.ecotrans.eu
      - GF_SMTP_ENABLED=true
      - GF_SMTP_HOST=${SMTP_HOST}
      - GF_SMTP_USER=${SMTP_USER}
      - GF_SMTP_PASSWORD=${SMTP_PASSWORD}
      - GF_SMTP_FROM_ADDRESS=monitoring@ecotrans.eu
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    ports:
      - "127.0.0.1:3000:3000"
    depends_on: [thanos-query]
 
  queue-exporter:
    build: ./exporters/queue-exporter
    restart: unless-stopped
    environment:
      - SUPABASE_URL=${SUPABASE_URL}
      - SUPABASE_SERVICE_KEY=${SUPABASE_SERVICE_KEY}
    ports:
      - "127.0.0.1:9200:9200"
 
  caddy:
    image: caddy:latest
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro
      - caddy_data:/data
      - caddy_config:/config
 
volumes:
  prometheus_data:
  alertmanager_data:
  grafana_data:
  caddy_data:
  caddy_config:

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: ecotrans-fleet
    env: production
 
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
 
rule_files:
  - /etc/prometheus/rules/*.yml
 
scrape_configs:
  # System metrics — all VPS servers
  - job_name: node
    static_configs:
      - targets:
          - server-a.ecotrans.internal:9100   # IONOS — OpenClaw
          - server-b.ecotrans.internal:9100   # OVH — n8n + Flask + Whisper
          - server-f.ecotrans.internal:9100   # OVH — this monitoring server (self)
 
  # Docker container metrics (cAdvisor)
  - job_name: cadvisor
    static_configs:
      - targets:
          - server-b.ecotrans.internal:8080
          - server-f.ecotrans.internal:8080
 
  # n8n workflow metrics — requires N8N_METRICS=true env on n8n container
  - job_name: n8n
    scrape_interval: 60s
    metrics_path: /metrics
    static_configs:
      - targets:
          - server-b.ecotrans.internal:5678
 
  # Supabase queue depths (custom Python exporter)
  - job_name: queue_depths
    scrape_interval: 60s
    static_configs:
      - targets:
          - localhost:9200
 
  # Self-monitoring
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']
 
  - job_name: thanos
    static_configs:
      - targets: ['thanos-sidecar:10902', 'thanos-query:9091']

Network — public IPs + firewall (decided): Scraping uses the public IP of each server. On every monitored server, firewall rules allow ports 9100 (node_exporter) and 8080 (cAdvisor) from Server F’s IP only — all other sources blocked. No private network needed. In practice, replace the hostnames above with actual public IPs or /etc/hosts aliases on Server F.

Example UFW rule on Server A/B (run once per monitored server):
ufw allow from <SERVER_F_PUBLIC_IP> to any port 9100
ufw allow from <SERVER_F_PUBLIC_IP> to any port 8080

Thanos / Wasabi S3 Configuration

# thanos/s3.yml
type: S3
config:
  bucket: ecotrans-monitoring
  endpoint: s3.eu-central-1.wasabisys.com   # must be explicit — no AWS default
  region: eu-central-1                       # must match Wasabi bucket region
  access_key: ${WASABI_ACCESS_KEY}
  secret_key: ${WASABI_SECRET_KEY}
  insecure: false

Wasabi gotchas (confirmed via Perplexity research):

The endpoint field is mandatory — Thanos defaults to AWS if omitted and will fail against Wasabi

region must match the region the Wasabi bucket was created in

Newly uploaded blocks are eventually consistent — allow 5–10 minutes before querying archived data via Thanos Query

If scrapes work against AWS S3 but not Wasabi, the problem is always one of: endpoint, region, or DNS addressing style

Data retention tiers:

Tier	Location	Resolution	Retention
Raw	Local Prometheus TSDB	15s	15 days
Raw blocks	Wasabi S3 (uploaded by sidecar every 2h)	15s	30 days
Downsampled 5m	Wasabi S3 (Thanos compact)	5 min	90 days
Downsampled 1h	Wasabi S3 (Thanos compact)	1 hour	1 year

Estimated Wasabi cost: ~3–8 GB compressed metrics/month → < 1€/month.

Thanos compaction (run weekly as one-off, not persistent container — saves RAM):

docker run --rm \
  -v $(pwd)/thanos:/etc/thanos \
  quay.io/thanos/thanos:latest \
  compact \
  --objstore.config-file=/etc/thanos/s3.yml \
  --retention.resolution-raw=30d \
  --retention.resolution-5m=90d \
  --retention.resolution-1h=365d \
  --wait

Alerting Rules

# prometheus/rules/infrastructure.yml
groups:
  - name: servers
    rules:
      - alert: ServerDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Server {{ $labels.instance }} is unreachable"
          description: "node_exporter on {{ $labels.instance }} down for 2+ minutes"
 
      - alert: HighCPU
        expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}: {{ $value | printf \"%.0f\" }}%"
 
      - alert: LowDisk
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk on {{ $labels.instance }}: {{ $value | printf \"%.0f\" }}% free"
 
      - alert: HighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory on {{ $labels.instance }}: {{ $value | printf \"%.0f\" }}% used"
 
  - name: containers
    rules:
      - alert: ContainerHighRestarts
        expr: increase(container_restart_count{name!=""}[1h]) > 3
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} restarted {{ $value | printf \"%.0f\" }}× in 1h"
 
      - alert: ContainerHighRestartsCritical
        expr: increase(container_restart_count{name!=""}[30m]) > 5
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} crash-looping"

# prometheus/rules/queues.yml
groups:
  - name: processing_queues
    rules:
      - alert: TranscriptionQueueWarning
        expr: supabase_queue_depth{queue="pending_transcriptions",status="pending"} > 50
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Transcription queue: {{ $value | printf \"%.0f\" }} pending jobs"
 
      - alert: TranscriptionQueueCritical
        expr: supabase_queue_depth{queue="pending_transcriptions",status="pending"} > 200
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Transcription queue critical: {{ $value | printf \"%.0f\" }} pending"
 
      - alert: PDFQueueWarning
        expr: supabase_queue_depth{queue="pending_pdf_processing",status="pending"} > 30
        for: 10m
        labels:
          severity: warning
 
      - alert: StuckProcessingJobs
        expr: supabase_queue_depth{status="processing"} > 0
        for: 35m
        labels:
          severity: warning
        annotations:
          summary: "Jobs stuck in 'processing' for >35m on {{ $labels.queue }} — timeout logic may have failed"

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: '${SMTP_HOST}:587'
  smtp_from: 'monitoring@ecotrans.eu'
  smtp_auth_username: '${SMTP_USER}'
  smtp_auth_password: '${SMTP_PASSWORD}'
  smtp_require_tls: true
 
route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: email-default
 
  routes:
    - matchers:
        - severity = critical
      receiver: email-critical
      repeat_interval: 1h
 
    - matchers:
        - severity = warning
      receiver: email-warning-digest
      group_wait: 4h
      group_interval: 12h
      repeat_interval: 24h
 
receivers:
  - name: email-critical
    email_configs:
      - to: 'radieu@gmail.com'
        subject: '[CRITICAL] {{ .GroupLabels.alertname }} — Ecotrans Fleet'
        send_resolved: true
 
  - name: email-default
    email_configs:
      - to: 'radieu@gmail.com'
        subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        send_resolved: true
 
  - name: email-warning-digest
    email_configs:
      - to: 'radieu@gmail.com'
        subject: '[WARNING Digest] Ecotrans Fleet Monitoring'
        send_resolved: false
 
inhibit_rules:
  # Critical on same instance suppresses its own warnings
  - source_matchers: [severity="critical"]
    target_matchers: [severity="warning"]
    equal: [instance]

Grafana Provisioning

Data Sources

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus
    url: http://thanos-query:9091
    access: proxy
    isDefault: true
    jsonData:
      timeInterval: 15s

# grafana/provisioning/datasources/supabase.yml
apiVersion: 1
datasources:
  - name: Supabase
    type: postgres
    uid: supabase
    # Use direct connection :5432, NOT pooler :6543
    # Pooler (Supavisor) is for short-lived connections; Grafana holds persistent connections
    url: ${SUPABASE_DB_HOST}:5432
    database: postgres
    user: grafana_readonly
    secureJsonData:
      password: ${SUPABASE_GRAFANA_PASSWORD}
    jsonData:
      sslmode: require       # 'require' is sufficient; 'verify-full' needs CA cert file
      maxOpenConns: 5
      maxIdleConns: 2
      connMaxLifetime: 14400
      postgresVersion: 1500
      timescaledb: false

Supabase — create read-only role (SQL Editor):

CREATE ROLE grafana_readonly WITH LOGIN PASSWORD 'strong-password-here';
GRANT USAGE ON SCHEMA public TO grafana_readonly;
GRANT SELECT ON
  pending_transcriptions,
  pending_pdf_processing,
  incidents,
  fleet_positions,
  whatsapp_messages
TO grafana_readonly;

Caddy Reverse Proxy

# Caddyfile
monitoring.ecotrans.eu {
    reverse_proxy grafana:3000
    encode gzip
}

Caddy auto-provisions TLS via Let’s Encrypt. No manual certificate management needed.

Custom Queue Depth Exporter

Lightweight Python service that queries Supabase every 60s and exposes metrics in Prometheus format.

# exporters/queue-exporter/app.py
import os
import time
from prometheus_client import start_http_server, Gauge
from supabase import create_client
 
QUEUE_DEPTH = Gauge(
    'supabase_queue_depth',
    'Number of jobs by queue table and status',
    ['queue', 'status']
)
 
QUEUES = ['pending_transcriptions', 'pending_pdf_processing']
STATUSES = ['pending', 'processing', 'failed', 'failed_permanently', 'completed']
 
def collect():
    client = create_client(os.environ['SUPABASE_URL'], os.environ['SUPABASE_SERVICE_KEY'])
    for table in QUEUES:
        for status in STATUSES:
            try:
                result = client.table(table).select('id', count='exact').eq('status', status).execute()
                QUEUE_DEPTH.labels(queue=table, status=status).set(result.count or 0)
            except Exception as e:
                print(f"Error collecting {table}/{status}: {e}")
 
if __name__ == '__main__':
    start_http_server(9200)
    print("Queue exporter running on :9200")
    while True:
        collect()
        time.sleep(60)

# exporters/queue-exporter/Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
CMD ["python", "app.py"]

# exporters/queue-exporter/requirements.txt
prometheus_client==0.20.0
supabase==2.4.0

n8n Metrics Setup

On Server B, add these env vars to the n8n container and recreate (not just restart):

# server-b docker-compose.yml — n8n service additions
services:
  n8n:
    environment:
      - N8N_METRICS=true
      - N8N_METRICS_INCLUDE_QUEUE_METRICS=true
      - N8N_METRICS_QUEUE_METRICS_INTERVAL=5000

Verify with: curl http://server-b:5678/metrics | grep n8n_

Common pitfall: Container must be recreated (docker compose up -d --force-recreate n8n), not just restarted — env vars don’t update on docker compose restart.

Dashboard Specifications

Dashboard 1: Infrastructure Overview

Stat row: All servers UP/DOWN (colored green/red)
Time series: CPU % per server (5m avg)
Time series: RAM used % per server
Gauge: Disk free % per server (threshold: warn <20%, critical <10%)
Time series: Network I/O (Mbps) per server

Dashboard 2: Docker Container Health

Table: All containers — name | status | restart count 1h | CPU% | RAM MB
Time series: Container restart events over 24h
Time series: Per-service CPU and RAM (n8n, flask-pdf, faster-whisper)

Dashboard 3: Processing Queues (dual source)

Big stat: Current pending transcriptions (Prometheus — queue-exporter)
Big stat: Current pending PDF jobs (Prometheus — queue-exporter)
Time series: Queue depth over time by status (pending / processing / failed)

Bar chart: Jobs completed per hour (Supabase SQL):

SELECT
  date_trunc('hour', updated_at) AS time,
  COUNT(*) AS completed
FROM pending_transcriptions
WHERE status = 'completed' AND updated_at > NOW() - INTERVAL '24 hours'
GROUP BY 1 ORDER BY 1

Stat: Stuck jobs (processing > 35 min)
Stat: Failed permanently last 24h

Dashboard 4: Business KPIs (Supabase SQL panels)

Stat: Incidents created today vs yesterday

Time series: Incidents per day (30 days rolling):

SELECT date_trunc('day', created_at) AS time, COUNT(*) AS incidents
FROM incidents
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY 1 ORDER BY 1

Stat: GPS coverage — % of fleet updated in last 10 minutes:

SELECT ROUND(
  100.0 * COUNT(*) FILTER (WHERE last_updated > NOW() - INTERVAL '10 minutes')
  / NULLIF(COUNT(*), 0), 1
) AS coverage_pct
FROM fleet_positions

Time series: WhatsApp messages received per hour
Time series: PDFs processed per day

Dashboard 5: n8n Workflow Health (Prometheus)

Stat: Workflow success rate last 24h
Time series: Execution counts by workflow name
Time series: Execution duration p50/p95
Table: Failed workflows with error type and count

Claude Code SSH Access Model

Initial server setup

# Run on Server F during first provisioning
adduser claude-admin --disabled-password
usermod -aG docker claude-admin
 
# Sudoers — scoped to docker and systemctl only
cat > /etc/sudoers.d/claude-admin <<EOF
claude-admin ALL=(ALL) NOPASSWD: /usr/bin/docker, /usr/local/bin/docker, /usr/bin/systemctl
EOF
 
# Add Claude Code's public key
mkdir -p /home/claude-admin/.ssh
echo "ssh-ed25519 AAAA...KEY... claude-code-monitoring" >> /home/claude-admin/.ssh/authorized_keys
chmod 700 /home/claude-admin/.ssh && chmod 600 /home/claude-admin/.ssh/authorized_keys

The Claude Code SSH private key is stored as a session environment variable (VPS_MONITORING_SSH_KEY) — never committed to the repo.

Common admin operations

# Check stack health
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose ps"
 
# View live logs
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose logs --tail=100 grafana"
 
# Reload Prometheus config (hot reload — no downtime)
ssh claude-admin@<server-f-ip> "curl -X POST http://localhost:9090/-/reload"
 
# Restart single service
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose restart alertmanager"
 
# Pull latest images and redeploy
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose pull && docker compose up -d"
 
# Push updated config file (example: alert rules)
scp ./prometheus/rules/queues.yml claude-admin@<ip>:/opt/monitoring/prometheus/rules/queues.yml
ssh claude-admin@<ip> "curl -X POST http://localhost:9090/-/reload"
 
# Check Thanos Wasabi upload status
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose logs thanos-sidecar | grep -E 'uploaded|error|failed'"
 
# Run Thanos compaction on-demand
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker run --rm \
  -v \$(pwd)/thanos:/etc/thanos \
  quay.io/thanos/thanos:latest \
  compact --objstore.config-file=/etc/thanos/s3.yml \
  --retention.resolution-raw=30d \
  --retention.resolution-5m=90d \
  --retention.resolution-1h=365d \
  --wait"

Installing exporters on monitored servers (scripts)

# scripts/install-node-exporter.sh — run on Server A, B, etc.
NODE_EXPORTER_VERSION=1.8.0
wget -qO- https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz | tar xz
sudo cp node_exporter-*/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter 2>/dev/null || true
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter --web.listen-address=0.0.0.0:9100
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload && sudo systemctl enable --now node_exporter
echo "node_exporter running: $(systemctl is-active node_exporter)"

# scripts/install-cadvisor.sh — run on Docker hosts (B, F)
docker run -d \
  --restart=unless-stopped \
  --name=cadvisor \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --publish=127.0.0.1:8080:8080 \
  gcr.io/cadvisor/cadvisor:latest

Environment Variables

# .env.example — copy to .env and fill on server; NEVER commit .env
GRAFANA_ADMIN_PASSWORD=
 
# SMTP — Mailgun EU region (smtp.eu.mailgun.org for EU-hosted accounts)
SMTP_HOST=smtp.eu.mailgun.org
SMTP_USER=
SMTP_PASSWORD=
 
# Server F public IP — used in firewall rules on monitored servers
SERVER_F_PUBLIC_IP=
 
# Wasabi S3 (scoped IAM key for ecotrans-monitoring bucket only)
WASABI_ACCESS_KEY=
WASABI_SECRET_KEY=
 
# Supabase
SUPABASE_URL=https://<project-ref>.supabase.co
SUPABASE_SERVICE_KEY=
SUPABASE_DB_HOST=db.<project-ref>.supabase.co
SUPABASE_GRAFANA_PASSWORD=

Security Model

Layer	Rule
OVH/VPS firewall	Open globally: 22, 80, 443. Open to Server F IP only: 9100, 8080 on monitored servers
All containers	Bind to `127.0.0.1` — no direct external access
Caddy	TLS for all external traffic; auto-renew via Let’s Encrypt
Grafana external	`https://monitoring.ecotrans.eu` — login required
Prometheus / Alertmanager	Not exposed externally — Grafana queries internally
SSH access	Key-based only; `PasswordAuthentication no`; `PermitRootLogin no`
Supabase role	`grafana_readonly` — SELECT only on 5 specific tables
Wasabi IAM	Scoped key with access to `ecotrans-monitoring` bucket only
Secrets	All in `.env` on server — never committed to git

Daily Health Report (n8n Workflow)

Grafana OSS doesn’t support scheduled email reports — this is handled by an n8n workflow running at 08:00 daily.

Workflow: monitoring-daily-report

Cron (08:00 daily)
  → HTTP GET Grafana snapshot API (create snapshot for each dashboard)
  → HTTP GET screenshot via Grafana image renderer (if installed)
  → Build HTML email with:
      - Infrastructure status summary (query Prometheus API)
      - Queue depths (query Prometheus API)
      - Incident count last 24h (query Supabase)
      - Any active alerts (query Alertmanager API)
  → Send email via Mailgun

Prometheus API queries for the email (HTTP Request nodes):

# All targets status
GET http://<server-f-ip>:9090/api/v1/query?query=up

# Active alerts
GET http://<server-f-ip>:9093/api/v2/alerts

# Queue depths
GET http://<server-f-ip>:9090/api/v1/query?query=supabase_queue_depth{status="pending"}

These internal calls go from n8n (Server B) to Server F’s ports 9090 and 9093. Add firewall rule: allow Server B IP → port 9090, 9093 on Server F.

Alternative (simpler): Install grafana-image-renderer plugin into the Grafana container. Then use Grafana’s /render/d/<dashboard-id> endpoint to generate PNG screenshots, attach to email. No custom queries needed.

# grafana service addition in docker-compose.yml
  grafana:
    environment:
      - GF_RENDERING_SERVER_URL=http://renderer:8081/render
      - GF_RENDERING_CALLBACK_URL=http://grafana:3000/
    depends_on: [renderer]
 
  renderer:
    image: grafana/grafana-image-renderer:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:8081:8081"

Testing Environment

For initial setup and testing, use the existing VPS defined in .env.local instead of provisioning OVH Server F.

The .env.local file (in this repo, not committed) contains:

TEST_VPS_HOST=        # IP or hostname of the test VPS
TEST_VPS_USER=        # SSH user (claude-admin or root for initial setup)
TEST_VPS_SSH_KEY=     # path to SSH private key

What changes for the test environment:

Same docker-compose.yml and configs — no test-specific variants
Skip DNS (monitoring.ecotrans.eu) — access Grafana via http://<test-vps-ip>:3000 directly (or via Caddy with a self-signed cert)
Wasabi bucket: use a separate ecotrans-monitoring-test bucket to avoid polluting production TSDB blocks
Monitored targets: only the test VPS itself (self-monitoring) + Server B if accessible; skip IONOS server A
The test VPS is treated as a throwaway — recreate freely

Switching target in Claude Code sessions:

# Test VPS
ssh $TEST_VPS_USER@$TEST_VPS_HOST "cd /opt/monitoring && docker compose ps"
 
# Production (OVH Server F, later)
ssh claude-admin@<server-f-ip> "cd /opt/monitoring && docker compose ps"

Implementation Checklist

Phase 0 — Test environment (on existing VPS from .env.local)

SSH into test VPS, install Docker + Docker Compose v2
Create Wasabi bucket ecotrans-monitoring-test
Deploy full stack (docker compose up -d)
Install node_exporter + cAdvisor on the test VPS (self-monitoring)
Verify Prometheus scrapes self → Grafana shows data
Verify Thanos uploads to ecotrans-monitoring-test bucket
Build and validate all 5 dashboards
Test alert pipeline: stop node_exporter → email received
Test n8n daily report workflow

Phase 1 — Production server provisioning (OVH Server F)

Provision OVH VPS2 Server F (Ubuntu 24.04 LTS, 6 cores, 12GB RAM)
Create claude-admin user, add SSH key, configure sudoers
Install Docker + Docker Compose v2
OVH firewall: allow 22, 80, 443
Create Wasabi bucket ecotrans-monitoring in eu-central-1
Create Wasabi IAM user scoped to that bucket; copy keys to .env
Create Supabase grafana_readonly role (SQL above)
DNS: monitoring.ecotrans.eu → Server F IP

Phase 2 — Core stack

Clone monitoring repo to /opt/monitoring on Server F
Populate .env with all secrets
Validate Prometheus config: docker run --rm -v $(pwd)/prometheus:/etc/prometheus prom/prometheus:latest promtool check config /etc/prometheus/prometheus.yml
docker compose up -d
Verify Prometheus UI: curl http://localhost:9090/-/healthy
Verify Thanos sidecar: curl http://localhost:10902/-/healthy
Verify Grafana: https://monitoring.ecotrans.eu
After ~2h: confirm first block uploaded to Wasabi

Phase 3 — Exporters on monitored servers

Run install-node-exporter.sh on Server A (IONOS)
Run install-node-exporter.sh on Server B (OVH)
Run install-cadvisor.sh on Server B
Run install-cadvisor.sh on Server F (self)
Enable n8n metrics on Server B: add env vars, recreate container
Verify all targets green: Prometheus → Status → Targets
Verify queue-exporter: curl http://localhost:9200/metrics | grep supabase_queue

Phase 4 — Grafana dashboards

Confirm Prometheus data source: test query up
Confirm Supabase data source: test query SELECT 1
Build Dashboard 1 (Infrastructure Overview)
Build Dashboard 2 (Docker Health)
Build Dashboard 3 (Processing Queues)
Build Dashboard 4 (Business KPIs)
Build Dashboard 5 (n8n Workflows)
Export all dashboard JSON → commit to grafana/provisioning/dashboards/

Phase 5 — Alerting

Validate rules: promtool check rules prometheus/rules/*.yml
Test P1: manually stop node_exporter → confirm email within 3 min
Test P2: simulate warning → confirm digest batching works
Set up weekly Thanos compaction cron on Server F

Phase 6 — Hardening

Add Docker log rotation to compose (json-file driver, max 50MB × 3)
Document runbook: “container down”, “disk full”, “Wasabi upload stalled”
Verify OVH VPS snapshots enabled (monthly)

Cost Summary

Component	Cost
OVH VPS2 Server F (6 cores, 12GB RAM)	7€/month
Wasabi S3 `ecotrans-monitoring`	~1€/month
`monitoring.ecotrans.eu` subdomain	included in domain
TLS certificate (Caddy + Let’s Encrypt)	free
Total monitoring stack	~8€/month

Architectural Decisions

#	Decision	Rationale
1	Network: public IPs + UFW firewall rules	Simplest — no private network provisioning; firewall on each server allows ports 9100/8080 from Server F IP only
2	Daily report: n8n workflow	Grafana OSS has no scheduled reports; n8n already in stack — see Daily Health Report section above
3	SMTP: Mailgun EU	Already in stack; EU region host is `smtp.eu.mailgun.org`
4	Testing: existing VPS from `.env.local`	Avoids early OVH cost; full stack tested before production provisioning — see Testing Environment section
5	Thanos compact: weekly one-off Docker run	12GB VPS can’t spare ~2GB RAM for persistent compactor; cron job is sufficient for this data volume

p24-infra Docs

Explorer

monitoring-prometheus-grafana