Monitoring Stack — Operations Workbook

Covers: Prometheus, Thanos (sidecar + query), Alertmanager, Loki, Promtail, Blackbox Exporter, Caddy — all running on IONOS VPS (vps-i1).


Architecture

IONOS VPS (217.154.82.162)  ─── Caddy (443 TLS) ─── public HTTPS endpoints

├── prometheus:9090          Metrics collection, 15d local TSDB
│   └── thanos-sidecar:10901 Uploads 2h TSDB blocks → Wasabi ecotrans-monitoring

├── thanos-query:10904       Unified PromQL: local + Wasabi long-term
├── alertmanager:9093        Alert routing → email (Mailgun EU)
├── loki:3100                Log aggregation (14-day retention)
├── promtail                 Ships Docker logs → Loki
├── blackbox-exporter:9115   HTTP/HTTPS probes (synthetic checks)
└── caddy:80/443             TLS termination for all above
 
Long-term storage: Wasabi s3://ecotrans-monitoring (eu-central-1)

Compose file: /opt/p24-infra/monitoring/docker-compose.yml
Public URLs:

ServiceURL
Prometheushttps://prometheus.vps-i1.infra.zintegrowana.online
Alertmanagerhttps://alertmanager.vps-i1.infra.zintegrowana.online

Both protected by Caddy basic_auth (username: admin, password: GRAFANA_ADMIN_PASSWORD).


Config Management

FileIn repo?Purpose
prometheus/prometheus.ymlScrape targets
prometheus/rules/*.ymlAlert rules
prometheus/blackbox.ymlBlackbox probe config
alertmanager/alertmanager.ymlAlert routing + receivers
loki/loki-config.ymlLoki storage + retention
promtail/config-vps-i1.ymlLog scrape config
CaddyfileReverse proxy + TLS
thanos/s3.yml✅ templateWasabi S3 config (from template + .env)
.env❌ (.env.example)Secrets

Updating alert rules (hot reload)

# Edit monitoring/prometheus/rules/*.yml → commit → on vps-i1:
git pull
curl -X POST http://localhost:9090/-/reload
# No restart needed

Updating Alertmanager config

# Edit monitoring/alertmanager/alertmanager.yml → commit → on vps-i1:
git pull
curl -X POST http://localhost:9093/-/reload

Deployment

Full stack bring-up

cd /opt/p24-infra/monitoring
docker compose up -d

Restart individual service

docker compose restart prometheus
docker compose restart alertmanager
docker compose restart caddy

Check stack health

cd /opt/p24-infra/monitoring
docker compose ps
docker compose logs --tail=30 prometheus

Backup

What needs backing up

DataBackup methodScheduleDestination
Prometheus TSDBThanos sidecar → WasabiContinuous (2h blocks)s3://ecotrans-monitoring/
Alertmanager silencesNot backed upGap — silences are ephemeral
Caddy TLS certscaddy_data volume — not backed upGap — auto-renewed via ACME
Prometheus config + rulesGit repoOn pushGitHub

Caddy certs note: If caddy_data is lost, Caddy will re-request Let’s Encrypt certificates automatically on restart. Brief downtime (~1 min) during renewal. Not a data-loss risk.

Manual Prometheus backup (emergency — force Thanos upload)

# Trigger Thanos compaction to flush any pending blocks
docker run --rm \
  -v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
  quay.io/thanos/thanos:v0.36.1 \
  compact --objstore.config-file /s3.yml --wait

Restore

Prometheus — Restore from Wasabi

# 1. List available blocks
docker run --rm \
  -v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
  quay.io/thanos/thanos:v0.36.1 \
  tools bucket ls --objstore.config-file /s3.yml
 
# 2. Stop Prometheus and Thanos sidecar
cd /opt/p24-infra/monitoring
docker compose stop thanos-sidecar prometheus
 
# 3. Restore specific block
docker run --rm \
  -v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
  -v prometheus_data:/prometheus \
  quay.io/thanos/thanos:v0.36.1 \
  tools bucket rewrite --objstore.config-file /s3.yml \
  --id <BLOCK_ULID> --output-dir /prometheus
 
# 4. Start Prometheus (Thanos will continue uploading)
docker compose up -d prometheus thanos-sidecar

Caddy — Fresh cert after volume loss

# Just restart — Caddy auto-renews
docker compose up -d caddy
# Monitor logs during first start
docker compose logs -f caddy

Loki — Data loss is acceptable

Loki stores logs with 14-day retention. On a fresh start, log history is empty — only new logs will appear. This is acceptable by design.


Healthchecks

All services have Docker healthcheck: directives (added 2026-05-14):

ServiceCheck endpointInterval
prometheus/-/healthy30s
thanos-sidecar/-/healthy (port 10902)30s
thanos-query/-/healthy (port 10904)30s
alertmanager/-/healthy30s
loki/ready30s
promtail/ready (port 9080)30s
blackbox-exporter/health30s
caddy/config/ (admin port 2019)30s

External probes: Prometheus infrastructure.yml rules fire ServerDown within 2 min.


Alert Rules Reference

Rule fileKey alerts
infrastructure.ymlServerDown, ContainerCrashLooping, LowDisk, HighMemory, HighCPU
backups.ymlBackupStale (>26h), BackupSizeRegression
synthetic.ymlEndpointDown, EndpointSlow
security.ymlSSHAuthFailures
costs.ymlVercelApproachingFreeTier, SupabaseDbSizeApproachingPro
queues.ymlTranscriptionQueueCritical
loki.ymlLokiIngestionStopped
n8n.ymlN8nWorkflowFailed, N8nSnapshotStale

Password Rotation

basic_auth (Prometheus + Alertmanager public URLs)

Caddy basic_auth uses GRAFANA_ADMIN_PASSWORD (same as Grafana admin). Rotate via:

# Generate bcrypt hash for Caddyfile
docker run --rm caddy:2.8-alpine caddy hash-password --plaintext "${NEW_PASS}"
# Update Caddyfile with new hash, then:
docker compose restart caddy

Also update .env GRAFANA_ADMIN_PASSWORD — see docs/grafana-operations.md.


Troubleshooting

SymptomCauseFix
Prometheus targets show DOWNExporter container restarteddocker compose restart <exporter>
Thanos upload stalledWasabi connectivity issueCheck docker compose logs thanos-sidecar; verify S3 creds in .env
Alertmanager not sending emailSMTP config wrongcurl -X POST http://localhost:9093/-/reload; check Mailgun dashboard
Caddy TLS renewal failedRate limit or DNS not resolvingCheck Caddy logs; verify DNS wildcard record
Loki not receiving logsPromtail cannot reach Lokidocker compose restart promtail; check loki_data volume space
blackbox-exporter probe failsTarget unreachableVerify URL + Caddy config for the target service
queue-exporter scrapes 0 queuesAll rows active=false in registrySELECT * FROM dev_r_exporters_queues; — set active=true for desired rows
queue-exporter query errorRLS policy on target table blocks grafana_readonlyGrant SELECT on the table; check for RLS policies referencing other tables
pg-stats-exporter connection errorIPv6 DNS / wrong DB hostEnsure SUPABASE_DB_HOST is the session pooler (aws-1-eu-central-1.pooler.supabase.com), not db.*.supabase.co

Custom Exporters

All custom exporters run as Docker containers on vps-i1 (IONOS), built from monitoring/exporters/. Each exposes a /metrics endpoint scraped by Prometheus every 60s.

ExporterPortSourceWhat it publishes
queue-exporter:9200Supabase DB (psycopg2, grafana_readonly)Queue depths by status for tables in dev_r_exporters_queues
pg-stats-exporter:9201Supabase DB (psycopg2, grafana_readonly)Top-200 slowest queries from extensions.pg_stat_statements
cost-exporter:9210Vercel API + Supabase mgmt API + Wasabi S3Monthly spend / usage per service (daily refresh)
vercel-exporter:9202Vercel APIDeployment state per project (every 5m)
backup-exporter:9220/opt/backups/backup-status.promBackup age and size freshness

Rebuild after code change

cd /opt/p24-infra/monitoring
git pull
docker compose up -d --no-deps --build queue-exporter   # or whichever exporter

Queue Exporter — Managing Monitored Tables

The queue-exporter does not have a hardcoded list of tables. It reads dev_r_exporters_queues from Supabase on every poll cycle (60s). Changing which tables are monitored requires only a SQL row change — no code change, no redeploy.

Table schema

SELECT id, table_name, schema_name, label, status_column, active, notes
FROM dev_r_exporters_queues
ORDER BY label;
ColumnPurpose
table_namePostgres table to GROUP BY status_column
schema_nameSchema, default public
labelPrometheus label value for the queue dimension
status_columnColumn to group by, default status
activetrue = scrape each cycle; false = skip
notesFree text — why it’s there or why it’s paused

Add a new queue

INSERT INTO dev_r_exporters_queues (table_name, schema_name, label, status_column, active, notes)
VALUES ('my_jobs', 'public', 'my_jobs', 'status', true, 'Job processing queue added YYYY-MM-DD');

Also grant grafana_readonly SELECT on the table:

GRANT SELECT ON public.my_jobs TO grafana_readonly;

The exporter picks it up within 60s — no restart needed.

Pause a queue (keep row, stop scraping)

UPDATE dev_r_exporters_queues SET active = false WHERE table_name = 'my_jobs';

Remove a queue permanently

DELETE FROM dev_r_exporters_queues WHERE table_name = 'my_jobs';

Current registered queues

SELECT table_name, label, active, notes FROM dev_r_exporters_queues ORDER BY active DESC, label;

Permissions note

The exporter connects as grafana_readonly via the Supabase session pooler. If a table has RLS policies that reference other tables (e.g. profiles), the query will fail with permission denied. Fix: either grant SELECT on the referenced table too, or create a SECURITY DEFINER view and query that instead.