Monitoring Stack — Operations Workbook
Covers: Prometheus, Thanos (sidecar + query), Alertmanager, Loki, Promtail, Blackbox Exporter, Caddy — all running on IONOS VPS (vps-i1).
Architecture
IONOS VPS (217.154.82.162) ─── Caddy (443 TLS) ─── public HTTPS endpoints
│
├── prometheus:9090 Metrics collection, 15d local TSDB
│ └── thanos-sidecar:10901 Uploads 2h TSDB blocks → Wasabi ecotrans-monitoring
│
├── thanos-query:10904 Unified PromQL: local + Wasabi long-term
├── alertmanager:9093 Alert routing → email (Mailgun EU)
├── loki:3100 Log aggregation (14-day retention)
├── promtail Ships Docker logs → Loki
├── blackbox-exporter:9115 HTTP/HTTPS probes (synthetic checks)
└── caddy:80/443 TLS termination for all above
Long-term storage: Wasabi s3://ecotrans-monitoring (eu-central-1)Compose file: /opt/p24-infra/monitoring/docker-compose.yml
Public URLs:
| Service | URL |
|---|---|
| Prometheus | https://prometheus.vps-i1.infra.zintegrowana.online |
| Alertmanager | https://alertmanager.vps-i1.infra.zintegrowana.online |
Both protected by Caddy basic_auth (username: admin, password: GRAFANA_ADMIN_PASSWORD).
Config Management
| File | In repo? | Purpose |
|---|---|---|
prometheus/prometheus.yml | ✅ | Scrape targets |
prometheus/rules/*.yml | ✅ | Alert rules |
prometheus/blackbox.yml | ✅ | Blackbox probe config |
alertmanager/alertmanager.yml | ✅ | Alert routing + receivers |
loki/loki-config.yml | ✅ | Loki storage + retention |
promtail/config-vps-i1.yml | ✅ | Log scrape config |
Caddyfile | ✅ | Reverse proxy + TLS |
thanos/s3.yml | ✅ template | Wasabi S3 config (from template + .env) |
.env | ❌ (.env.example) | Secrets |
Updating alert rules (hot reload)
# Edit monitoring/prometheus/rules/*.yml → commit → on vps-i1:
git pull
curl -X POST http://localhost:9090/-/reload
# No restart neededUpdating Alertmanager config
# Edit monitoring/alertmanager/alertmanager.yml → commit → on vps-i1:
git pull
curl -X POST http://localhost:9093/-/reloadDeployment
Full stack bring-up
cd /opt/p24-infra/monitoring
docker compose up -dRestart individual service
docker compose restart prometheus
docker compose restart alertmanager
docker compose restart caddyCheck stack health
cd /opt/p24-infra/monitoring
docker compose ps
docker compose logs --tail=30 prometheusBackup
What needs backing up
| Data | Backup method | Schedule | Destination |
|---|---|---|---|
| Prometheus TSDB | Thanos sidecar → Wasabi | Continuous (2h blocks) | s3://ecotrans-monitoring/ |
| Alertmanager silences | Not backed up | — | Gap — silences are ephemeral |
| Caddy TLS certs | caddy_data volume — not backed up | — | Gap — auto-renewed via ACME |
| Prometheus config + rules | Git repo | On push | GitHub |
Caddy certs note: If
caddy_datais lost, Caddy will re-request Let’s Encrypt certificates automatically on restart. Brief downtime (~1 min) during renewal. Not a data-loss risk.
Manual Prometheus backup (emergency — force Thanos upload)
# Trigger Thanos compaction to flush any pending blocks
docker run --rm \
-v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
quay.io/thanos/thanos:v0.36.1 \
compact --objstore.config-file /s3.yml --waitRestore
Prometheus — Restore from Wasabi
# 1. List available blocks
docker run --rm \
-v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
quay.io/thanos/thanos:v0.36.1 \
tools bucket ls --objstore.config-file /s3.yml
# 2. Stop Prometheus and Thanos sidecar
cd /opt/p24-infra/monitoring
docker compose stop thanos-sidecar prometheus
# 3. Restore specific block
docker run --rm \
-v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
-v prometheus_data:/prometheus \
quay.io/thanos/thanos:v0.36.1 \
tools bucket rewrite --objstore.config-file /s3.yml \
--id <BLOCK_ULID> --output-dir /prometheus
# 4. Start Prometheus (Thanos will continue uploading)
docker compose up -d prometheus thanos-sidecarCaddy — Fresh cert after volume loss
# Just restart — Caddy auto-renews
docker compose up -d caddy
# Monitor logs during first start
docker compose logs -f caddyLoki — Data loss is acceptable
Loki stores logs with 14-day retention. On a fresh start, log history is empty — only new logs will appear. This is acceptable by design.
Healthchecks
All services have Docker healthcheck: directives (added 2026-05-14):
| Service | Check endpoint | Interval |
|---|---|---|
| prometheus | /-/healthy | 30s |
| thanos-sidecar | /-/healthy (port 10902) | 30s |
| thanos-query | /-/healthy (port 10904) | 30s |
| alertmanager | /-/healthy | 30s |
| loki | /ready | 30s |
| promtail | /ready (port 9080) | 30s |
| blackbox-exporter | /health | 30s |
| caddy | /config/ (admin port 2019) | 30s |
External probes: Prometheus infrastructure.yml rules fire ServerDown within 2 min.
Alert Rules Reference
| Rule file | Key alerts |
|---|---|
infrastructure.yml | ServerDown, ContainerCrashLooping, LowDisk, HighMemory, HighCPU |
backups.yml | BackupStale (>26h), BackupSizeRegression |
synthetic.yml | EndpointDown, EndpointSlow |
security.yml | SSHAuthFailures |
costs.yml | VercelApproachingFreeTier, SupabaseDbSizeApproachingPro |
queues.yml | TranscriptionQueueCritical |
loki.yml | LokiIngestionStopped |
n8n.yml | N8nWorkflowFailed, N8nSnapshotStale |
Password Rotation
basic_auth (Prometheus + Alertmanager public URLs)
Caddy basic_auth uses GRAFANA_ADMIN_PASSWORD (same as Grafana admin). Rotate via:
# Generate bcrypt hash for Caddyfile
docker run --rm caddy:2.8-alpine caddy hash-password --plaintext "${NEW_PASS}"
# Update Caddyfile with new hash, then:
docker compose restart caddyAlso update .env GRAFANA_ADMIN_PASSWORD — see docs/grafana-operations.md.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Prometheus targets show DOWN | Exporter container restarted | docker compose restart <exporter> |
| Thanos upload stalled | Wasabi connectivity issue | Check docker compose logs thanos-sidecar; verify S3 creds in .env |
| Alertmanager not sending email | SMTP config wrong | curl -X POST http://localhost:9093/-/reload; check Mailgun dashboard |
| Caddy TLS renewal failed | Rate limit or DNS not resolving | Check Caddy logs; verify DNS wildcard record |
| Loki not receiving logs | Promtail cannot reach Loki | docker compose restart promtail; check loki_data volume space |
| blackbox-exporter probe fails | Target unreachable | Verify URL + Caddy config for the target service |
| queue-exporter scrapes 0 queues | All rows active=false in registry | SELECT * FROM dev_r_exporters_queues; — set active=true for desired rows |
| queue-exporter query error | RLS policy on target table blocks grafana_readonly | Grant SELECT on the table; check for RLS policies referencing other tables |
| pg-stats-exporter connection error | IPv6 DNS / wrong DB host | Ensure SUPABASE_DB_HOST is the session pooler (aws-1-eu-central-1.pooler.supabase.com), not db.*.supabase.co |
Custom Exporters
All custom exporters run as Docker containers on vps-i1 (IONOS), built from monitoring/exporters/. Each exposes a /metrics endpoint scraped by Prometheus every 60s.
| Exporter | Port | Source | What it publishes |
|---|---|---|---|
queue-exporter | :9200 | Supabase DB (psycopg2, grafana_readonly) | Queue depths by status for tables in dev_r_exporters_queues |
pg-stats-exporter | :9201 | Supabase DB (psycopg2, grafana_readonly) | Top-200 slowest queries from extensions.pg_stat_statements |
cost-exporter | :9210 | Vercel API + Supabase mgmt API + Wasabi S3 | Monthly spend / usage per service (daily refresh) |
vercel-exporter | :9202 | Vercel API | Deployment state per project (every 5m) |
backup-exporter | :9220 | /opt/backups/backup-status.prom | Backup age and size freshness |
Rebuild after code change
cd /opt/p24-infra/monitoring
git pull
docker compose up -d --no-deps --build queue-exporter # or whichever exporterQueue Exporter — Managing Monitored Tables
The queue-exporter does not have a hardcoded list of tables. It reads dev_r_exporters_queues from Supabase on every poll cycle (60s). Changing which tables are monitored requires only a SQL row change — no code change, no redeploy.
Table schema
SELECT id, table_name, schema_name, label, status_column, active, notes
FROM dev_r_exporters_queues
ORDER BY label;| Column | Purpose |
|---|---|
table_name | Postgres table to GROUP BY status_column |
schema_name | Schema, default public |
label | Prometheus label value for the queue dimension |
status_column | Column to group by, default status |
active | true = scrape each cycle; false = skip |
notes | Free text — why it’s there or why it’s paused |
Add a new queue
INSERT INTO dev_r_exporters_queues (table_name, schema_name, label, status_column, active, notes)
VALUES ('my_jobs', 'public', 'my_jobs', 'status', true, 'Job processing queue added YYYY-MM-DD');Also grant grafana_readonly SELECT on the table:
GRANT SELECT ON public.my_jobs TO grafana_readonly;The exporter picks it up within 60s — no restart needed.
Pause a queue (keep row, stop scraping)
UPDATE dev_r_exporters_queues SET active = false WHERE table_name = 'my_jobs';Remove a queue permanently
DELETE FROM dev_r_exporters_queues WHERE table_name = 'my_jobs';Current registered queues
SELECT table_name, label, active, notes FROM dev_r_exporters_queues ORDER BY active DESC, label;Permissions note
The exporter connects as grafana_readonly via the Supabase session pooler. If a table has RLS policies that reference other tables (e.g. profiles), the query will fail with permission denied. Fix: either grant SELECT on the referenced table too, or create a SECURITY DEFINER view and query that instead.