04 — Monitoring

Observability stack: Prometheus + Thanos for metrics, Grafana for dashboards, Alertmanager for email/Discord alerts, and a suite of custom Python exporters covering Supabase queues, slow queries, backup status, costs, and credential rotation age.

Stack runs on vps-i1 (217.154.82.162) via Docker Compose at /opt/p24-infra/monitoring/.

Key Documents

DocumentDescription
monitoring-stack-operations.mdDocker Compose lifecycle — start, reload, upgrade
monitoring-prometheus-grafana.mdPrometheus scrape config, retention, Thanos sidecar/query
grafana-operations.mdDashboard management, datasource config, image renderer
monitoring-exporters-operations.mdAll custom exporters — ports, env vars, metrics exposed
alert-response-runbook.mdHow to respond to each alert rule
supabase-slow-query-monitoring.mdpg-stats-exporter slow-query monitoring setup

Component Map

ComponentImagePortPurpose
Prometheusprom/prometheus127.0.0.1:9090Scrapes all targets, 15d local TSDB
Thanos sidecarquay.io/thanos/thanosinternalUploads 2h blocks to Wasabi S3
Thanos queryquay.io/thanos/thanosinternalUnified PromQL over local + S3
Alertmanagerprom/alertmanager127.0.0.1:9093Email via Mailgun EU
Grafanagrafana/grafana127.0.0.1:3000Dashboards (Thanos + Supabase PostgreSQL)
queue-exportercustom Python:9200Supabase queue depths
pg-stats-exportercustom Python:9201pg_stat_statements slow queries
backup-exportercustom Python:9220Wasabi backup status JSON
cost-exportercustom Python:9210Vercel/Supabase/Wasabi billing
vercel-exportercustom PythoninternalVercel deployment metrics
credential-exportercustom PythoninternalCredential rotation age
grafana-image-renderergrafana/grafana-image-rendererinternalPNG screenshots for daily reports

Public URLs

URLService
grafana.vps-i1.infra.zintegrowana.onlineGrafana (Grafana login)
prometheus.vps-i1.infra.zintegrowana.onlinePrometheus (basic_auth)
alertmanager.vps-i1.infra.zintegrowana.onlineAlertmanager (basic_auth)
infra.zintegrowana.onlineGrafana public alias

Improvement Proposals

ProposalDescription
02-loki-logs.mdAdd Loki for log aggregation
05-blackbox-synthetic.mdBlackbox exporter for synthetic probes
06-consolidate-health-checks.mdConsolidate health-check endpoints
07-status-page.mdPublic status page (Uptime Kuma)
10-deployment-version-dashboard.mdDeployment version tracking dashboard
11-cost-dashboard.mdCost dashboard in Grafana
12-cert-expiry-alerts.mdTLS cert expiry alert rules

Cross-references

  • README — how to act on alerts fired by Alertmanager
  • README — Mailgun SMTP used for alert emails