Supabase slow-query monitoring

GitHub issue: radieu/p24-infra#38

Surfaces slow / hot / regressed queries on the production Supabase project (mwkqmgadqnkkihjdeqsi) by reading pg_stat_statements from two angles:

  • Live exploration — Grafana panels query the view directly via the existing Supabase PostgreSQL datasource (grafana_readonly). Real-time, no Prometheus dependency.
  • Long-term history + alertingpg-stats-exporter (custom Python, monitoring/exporters/pg-stats-exporter/) polls the view every 60s, publishes top-200 rows as Prometheus gauges on :9201, scraped by the monitoring Prometheus, stored long-term in Thanos/Wasabi.

One-time Supabase setup

pg_stat_statements is preloaded on every Supabase project. Two SQL statements are required, run once in the Supabase SQL editor with an admin role:

-- Records the extension in the schema if not already present.
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
 
-- Grants the grafana_readonly role permission to read the view + pg_stat_database.
GRANT pg_read_all_stats TO grafana_readonly;

Verification:

SET ROLE grafana_readonly;
SELECT count(*) FROM pg_stat_statements;
RESET ROLE;

Files in this repo

PathRole
monitoring/exporters/pg-stats-exporter/app.pyPolls pg_stat_statements, exposes Prometheus metrics on :9201
monitoring/exporters/pg-stats-exporter/DockerfileContainer image (python:3.11-slim + psycopg2-binary)
monitoring/exporters/pg-stats-exporter/requirements.txtPinned Python deps
monitoring/docker-compose.ymlAdds pg-stats-exporter service
monitoring/prometheus/prometheus.ymlAdds pg_stats scrape job (60s)
monitoring/prometheus/rules/supabase-queries.ymlAlert rules (slow, regressed, high-frequency, exporter down)
monitoring/grafana/provisioning/dashboards/supabase-slow-queries.jsonDashboard supabase-slow-queries (uid)

Exposed metrics

Labels: queryid (stable per normalised statement) + fingerprint (collapsed-whitespace query text, truncated to 80 chars).

MetricTypeSource column
supabase_query_calls_totalGaugepg_stat_statements.calls
supabase_query_total_time_secondsGaugetotal_exec_time / 1000
supabase_query_mean_time_secondsGaugemean_exec_time / 1000
supabase_query_rows_totalGaugerows
supabase_query_shared_blks_hit_totalGaugeshared_blks_hit
supabase_query_shared_blks_read_totalGaugeshared_blks_read
supabase_pg_stats_exporter_last_scrape_timestamp_secondsGaugeWall-clock of last good scrape
supabase_pg_stats_exporter_last_scrape_rowsGaugeRows pulled last cycle
supabase_pg_stats_exporter_errors_totalCounterScrape errors by error_type

The counters from pg_stat_statements are exposed as Prometheus gauges, not counters, because the underlying view can be reset (pg_stat_statements_reset()) and individual rows can disappear when crowded out — neither is compatible with the Prometheus counter contract. rate() over these gauges is still a reasonable approximation of calls/sec over short windows.

Alert rules — supabase_queries group

AlertExpressionSeverity
SupabasePgStatsExporterDownabsent(up{job="pg_stats"} == 1) for 5mwarning
SupabasePgStatsScrapeStalelast_scrape > 5m oldwarning
SupabaseSlowQueryMeanHighany statement mean >1s for 10mwarning
SupabaseSlowQueryMeanCriticalany statement mean >5s for 5mcritical
SupabaseQueryRegressedmean ≥2x vs 24h ago AND >250ms, for 15mwarning
SupabaseHighFrequencyQueryrate(calls[5m]) > 50 for 15mwarning

Tune thresholds in monitoring/prometheus/rules/supabase-queries.yml after baselining for a week.

Dashboard

URL once deployed: https://grafana.vps-i1.infra.zintegrowana.online/d/supabase-slow-queries

Top section uses the Supabase PG datasource (live pg_stat_statements); bottom section uses Prometheus (history + rates). Refreshes every 1 minute.

Deploy

On vps-i1 (IONOS) the repo lives at /opt/p24-infra and is auto-pulled by the Claude PreToolUse hook; explicit deploy:

ssh root@217.154.82.162
cd /opt/p24-infra && git pull
cd monitoring
docker compose build pg-stats-exporter
docker compose up -d pg-stats-exporter
# Reload Prometheus + Grafana picks up new dashboard/scrape job
curl -X POST http://localhost:9090/-/reload
docker compose restart grafana

Smoke tests

# Exporter metrics
docker exec monitoring-prometheus-1 wget -qO- http://pg-stats-exporter:9201/metrics | head -50
 
# Prometheus sees the target
curl -s -u admin:$GRAFANA_ADMIN_PASSWORD \
    'https://prometheus.vps-i1.infra.zintegrowana.online/api/v1/targets?state=active' \
    | jq '.data.activeTargets[] | select(.labels.job=="pg_stats")'
 
# Rules loaded
curl -s -u admin:$GRAFANA_ADMIN_PASSWORD \
    'https://prometheus.vps-i1.infra.zintegrowana.online/api/v1/rules' \
    | jq '.data.groups[] | select(.name=="supabase_queries")'
 
# Live view from Supabase
psql "host=aws-1-eu-central-1.pooler.supabase.com user=grafana_readonly.mwkqmgadqnkkihjdeqsi sslmode=require dbname=postgres" \
    -c "SELECT count(*) FROM pg_stat_statements;"

Troubleshooting

SymptomLikely cause
error_type="missing_view" counter risespg_stat_statements extension not created — run the CREATE EXTENSION block above
error_type="connection" rises, target downSupabase pooler unreachable — check IP allowlist / DNS / SUPABASE_GRAFANA_PASSWORD env on vps-i1
No data in dashboard top panelsgrafana_readonly lacks pg_read_all_stats — re-run the GRANT
Same queries appear forever even after they slowed downExpected: gauges reflect the top-N snapshot; values reset to 0 only when the row drops off the list. Use rate() for short-window trends