Monitoring Stack — Operations Workbook

Covers: Prometheus, Thanos (sidecar + query), Alertmanager, Loki, Promtail, Blackbox Exporter, Caddy — all running on IONOS VPS (vps-i1).

Architecture

IONOS VPS (217.154.82.162)  ─── Caddy (443 TLS) ─── public HTTPS endpoints
│
├── prometheus:9090          Metrics collection, 15d local TSDB
│   └── thanos-sidecar:10901 Uploads 2h TSDB blocks → Wasabi ecotrans-monitoring
│
├── thanos-query:10904       Unified PromQL: local + Wasabi long-term
├── alertmanager:9093        Alert routing → email (Mailgun EU)
├── loki:3100                Log aggregation (14-day retention)
├── promtail                 Ships Docker logs → Loki
├── blackbox-exporter:9115   HTTP/HTTPS probes (synthetic checks)
└── caddy:80/443             TLS termination for all above
 
Long-term storage: Wasabi s3://ecotrans-monitoring (eu-central-1)

Compose file: /opt/p24-infra/monitoring/docker-compose.yml
Public URLs:

Service	URL
Prometheus	`https://prometheus.vps-i1.infra.zintegrowana.online`
Alertmanager	`https://alertmanager.vps-i1.infra.zintegrowana.online`

Both protected by Caddy basic_auth (username: admin, password: GRAFANA_ADMIN_PASSWORD).

Config Management

File	In repo?	Purpose
`prometheus/prometheus.yml`	✅	Scrape targets
`prometheus/rules/*.yml`	✅	Alert rules
`prometheus/blackbox.yml`	✅	Blackbox probe config
`alertmanager/alertmanager.yml`	✅	Alert routing + receivers
`loki/loki-config.yml`	✅	Loki storage + retention
`promtail/config-vps-i1.yml`	✅	Log scrape config
`Caddyfile`	✅	Reverse proxy + TLS
`thanos/s3.yml`	✅ template	Wasabi S3 config (from template + .env)
`.env`	❌ (`.env.example`)	Secrets

Updating alert rules (hot reload)

# Edit monitoring/prometheus/rules/*.yml → commit → on vps-i1:
git pull
curl -X POST http://localhost:9090/-/reload
# No restart needed

Updating Alertmanager config

# Edit monitoring/alertmanager/alertmanager.yml → commit → on vps-i1:
git pull
curl -X POST http://localhost:9093/-/reload

Deployment

Full stack bring-up

cd /opt/p24-infra/monitoring
docker compose up -d

Restart individual service

docker compose restart prometheus
docker compose restart alertmanager
docker compose restart caddy

Check stack health

cd /opt/p24-infra/monitoring
docker compose ps
docker compose logs --tail=30 prometheus

Backup

What needs backing up

Data	Backup method	Schedule	Destination
Prometheus TSDB	Thanos sidecar → Wasabi	Continuous (2h blocks)	`s3://ecotrans-monitoring/`
Alertmanager silences	Not backed up	—	Gap — silences are ephemeral
Caddy TLS certs	`caddy_data` volume — not backed up	—	Gap — auto-renewed via ACME
Prometheus config + rules	Git repo	On push	GitHub

Caddy certs note: If caddy_data is lost, Caddy will re-request Let’s Encrypt certificates automatically on restart. Brief downtime (~1 min) during renewal. Not a data-loss risk.

Manual Prometheus backup (emergency — force Thanos upload)

# Trigger Thanos compaction to flush any pending blocks
docker run --rm \
  -v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
  quay.io/thanos/thanos:v0.36.1 \
  compact --objstore.config-file /s3.yml --wait

Restore

Prometheus — Restore from Wasabi

# 1. List available blocks
docker run --rm \
  -v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
  quay.io/thanos/thanos:v0.36.1 \
  tools bucket ls --objstore.config-file /s3.yml
 
# 2. Stop Prometheus and Thanos sidecar
cd /opt/p24-infra/monitoring
docker compose stop thanos-sidecar prometheus
 
# 3. Restore specific block
docker run --rm \
  -v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
  -v prometheus_data:/prometheus \
  quay.io/thanos/thanos:v0.36.1 \
  tools bucket rewrite --objstore.config-file /s3.yml \
  --id <BLOCK_ULID> --output-dir /prometheus
 
# 4. Start Prometheus (Thanos will continue uploading)
docker compose up -d prometheus thanos-sidecar

Caddy — Fresh cert after volume loss

# Just restart — Caddy auto-renews
docker compose up -d caddy
# Monitor logs during first start
docker compose logs -f caddy

Loki — Data loss is acceptable

Loki stores logs with 14-day retention. On a fresh start, log history is empty — only new logs will appear. This is acceptable by design.

Healthchecks

All services have Docker healthcheck: directives (added 2026-05-14):

Service	Check endpoint	Interval
prometheus	`/-/healthy`	30s
thanos-sidecar	`/-/healthy` (port 10902)	30s
thanos-query	`/-/healthy` (port 10904)	30s
alertmanager	`/-/healthy`	30s
loki	`/ready`	30s
promtail	`/ready` (port 9080)	30s
blackbox-exporter	`/health`	30s
caddy	`/config/` (admin port 2019)	30s

External probes: Prometheus infrastructure.yml rules fire ServerDown within 2 min.

Alert Rules Reference

Rule file	Key alerts
`infrastructure.yml`	ServerDown, ContainerCrashLooping, LowDisk, HighMemory, HighCPU
`backups.yml`	BackupStale (>26h), BackupSizeRegression
`synthetic.yml`	EndpointDown, EndpointSlow
`security.yml`	SSHAuthFailures
`costs.yml`	VercelApproachingFreeTier, SupabaseDbSizeApproachingPro
`queues.yml`	TranscriptionQueueCritical
`loki.yml`	LokiIngestionStopped
`n8n.yml`	N8nWorkflowFailed, N8nSnapshotStale

Password Rotation

basic_auth (Prometheus + Alertmanager public URLs)

Caddy basic_auth uses GRAFANA_ADMIN_PASSWORD (same as Grafana admin). Rotate via:

# Generate bcrypt hash for Caddyfile
docker run --rm caddy:2.8-alpine caddy hash-password --plaintext "${NEW_PASS}"
# Update Caddyfile with new hash, then:
docker compose restart caddy

Also update .env GRAFANA_ADMIN_PASSWORD — see docs/grafana-operations.md.

Troubleshooting

Symptom	Cause	Fix
Prometheus targets show `DOWN`	Exporter container restarted	`docker compose restart <exporter>`
Thanos upload stalled	Wasabi connectivity issue	Check `docker compose logs thanos-sidecar`; verify S3 creds in `.env`
Alertmanager not sending email	SMTP config wrong	`curl -X POST http://localhost:9093/-/reload`; check Mailgun dashboard
Caddy TLS renewal failed	Rate limit or DNS not resolving	Check Caddy logs; verify DNS wildcard record
Loki not receiving logs	Promtail cannot reach Loki	`docker compose restart promtail`; check `loki_data` volume space
blackbox-exporter probe fails	Target unreachable	Verify URL + Caddy config for the target service
queue-exporter scrapes 0 queues	All rows `active=false` in registry	`SELECT * FROM dev_r_exporters_queues;` — set `active=true` for desired rows
queue-exporter query error	RLS policy on target table blocks `grafana_readonly`	Grant SELECT on the table; check for RLS policies referencing other tables
pg-stats-exporter connection error	IPv6 DNS / wrong DB host	Ensure `SUPABASE_DB_HOST` is the session pooler (`aws-1-eu-central-1.pooler.supabase.com`), not `db.*.supabase.co`

Custom Exporters

All custom exporters run as Docker containers on vps-i1 (IONOS), built from monitoring/exporters/. Each exposes a /metrics endpoint scraped by Prometheus every 60s.

Exporter	Port	Source	What it publishes
`queue-exporter`	`:9200`	Supabase DB (psycopg2, `grafana_readonly`)	Queue depths by status for tables in `dev_r_exporters_queues`
`pg-stats-exporter`	`:9201`	Supabase DB (psycopg2, `grafana_readonly`)	Top-200 slowest queries from `extensions.pg_stat_statements`
`cost-exporter`	`:9210`	Vercel API + Supabase mgmt API + Wasabi S3	Monthly spend / usage per service (daily refresh)
`vercel-exporter`	`:9202`	Vercel API	Deployment state per project (every 5m)
`backup-exporter`	`:9220`	`/opt/backups/backup-status.prom`	Backup age and size freshness

Rebuild after code change

cd /opt/p24-infra/monitoring
git pull
docker compose up -d --no-deps --build queue-exporter   # or whichever exporter

Queue Exporter — Managing Monitored Tables

The queue-exporter does not have a hardcoded list of tables. It reads dev_r_exporters_queues from Supabase on every poll cycle (60s). Changing which tables are monitored requires only a SQL row change — no code change, no redeploy.

Table schema

SELECT id, table_name, schema_name, label, status_column, active, notes
FROM dev_r_exporters_queues
ORDER BY label;

Column	Purpose
`table_name`	Postgres table to `GROUP BY status_column`
`schema_name`	Schema, default `public`
`label`	Prometheus label value for the `queue` dimension
`status_column`	Column to group by, default `status`
`active`	`true` = scrape each cycle; `false` = skip
`notes`	Free text — why it’s there or why it’s paused

Add a new queue

INSERT INTO dev_r_exporters_queues (table_name, schema_name, label, status_column, active, notes)
VALUES ('my_jobs', 'public', 'my_jobs', 'status', true, 'Job processing queue added YYYY-MM-DD');

Also grant grafana_readonly SELECT on the table:

GRANT SELECT ON public.my_jobs TO grafana_readonly;

The exporter picks it up within 60s — no restart needed.

Pause a queue (keep row, stop scraping)

UPDATE dev_r_exporters_queues SET active = false WHERE table_name = 'my_jobs';

Remove a queue permanently

DELETE FROM dev_r_exporters_queues WHERE table_name = 'my_jobs';

Current registered queues

SELECT table_name, label, active, notes FROM dev_r_exporters_queues ORDER BY active DESC, label;

Permissions note

The exporter connects as grafana_readonly via the Supabase session pooler. If a table has RLS policies that reference other tables (e.g. profiles), the query will fail with permission denied. Fix: either grant SELECT on the referenced table too, or create a SECURITY DEFINER view and query that instead.

p24-infra Docs

Explorer

monitoring-stack-operations