Spec 05 — Synthetic monitoring (Blackbox exporter)
Purpose
We currently probe internal services on a 6-hour cadence. User-visible URLs (Vercel projects, public Grafana at infra.zintegrowana.online, n8n, WAHA) are only “noticed” when Vercel or Cloudflare emails us, or when a user complains. Blackbox exporter adds 30-second HTTP probes from inside the monitoring stack, feeding Prometheus and Grafana — same alerting pipeline as everything else.
It also produces the data needed for spec 12 (cert expiry alerts).
Rulebook
- Probe targets list lives in git.
monitoring/prometheus/prometheus.ymlunderblackbox_targets. - One module per protocol.
http_2xx,http_2xx_post,tcp_connect,dns_query— defined once, reused. - No deep checks from blackbox. Login flows, multi-step probes — those belong in Playwright. Blackbox is for “is it up + valid TLS + responds <2 s”.
- Probe from monitoring host only. Don’t run blackbox on each VPS; single observability vantage is cheaper and consistent.
Architecture
┌─ Prometheus (vps-i1) ─┐
│ scrapes every 30s │
└──────────┬────────────┘
│
▼
┌──── blackbox-exporter ────┐
│ ports HTTP/TCP probes to │
│ N public/internal URLs │
└────────────┬──────────────┘
│
┌─────────────────┬────┴─────┬──────────────────┐
▼ ▼ ▼ ▼
et-oper.vercel.app monitoring n8n.vps-h1 waha2.vps-h1
.eco-trans .infra...
.eu
Targets (initial):
| URL | Module | Expected |
|---|---|---|
https://et-operational-platform.vercel.app/api/health | http_2xx | 200, <2s |
https://et-operational-platform-7ktl.vercel.app/api/health | http_2xx | 200, <2s |
https://grafana.vps-i1.infra.zintegrowana.online | http_2xx | 200/302 |
https://n8n.vps-h1.infra.zintegrowana.online/healthz | http_2xx | 200 |
https://waha2.vps-h1.infra.zintegrowana.online/api/health | http_2xx | 200 |
https://eco-trans.eu | http_2xx | 200 |
Implementation plan
- Add
blackbox-exporterservice tomonitoring/docker-compose.yml. - Create
monitoring/prometheus/blackbox.ymlwith the 4 modules (http_2xx, http_2xx_post, tcp_connect, dns_query). - Add scrape config block to
monitoring/prometheus/prometheus.ymlusing therelabel_configstrick. - Add alerts to
monitoring/prometheus/rules/synthetic.yml:EndpointDown—probe_success == 0 for 2m— severity criticalEndpointSlow—probe_duration_seconds > 2 for 5m— severity warningCertExpiringSoon—probe_ssl_earliest_cert_expiry - time() < 7*24*3600— severity warning (closes spec 12)
- Provision Grafana dashboard “Synthetic checks” with one row per target.
Acceptance criteria
- Prometheus targets page shows 7 blackbox jobs all UP
- Grafana “Synthetic checks” dashboard renders with green status for all targets
- Manually stopping
et-operational-platform-7ktlVercel deployment triggersEndpointDownwithin 3 min -
probe_ssl_earliest_cert_expiryseries exists for all https targets
Cost impact
0 €. One small container.
Back-out plan
Remove blackbox-exporter service, remove scrape config + rules. No data loss elsewhere; only the synthetic series disappears.
Risks / open questions
- Risk: Probing Vercel from a single egress IP could trigger rate-limits. Mitigation: 30 s interval is well under any sane limit; abort if seen.
- Q: Should we probe et-operational-platform from EU and US? A: No, single vantage is fine for our scale.
Bootstrap (post-merge deployment)
After this PR is merged to main, a human runs the following on vps-i1 to bring the new blackbox-exporter container online and have Prometheus pick up the new scrape job + rules + dashboard.
# 1. SSH to vps-i1 as root
ssh -i C:\Users\konar\.ssh\id_ed25519 root@217.154.82.162
# 2. Pull the merged changes
cd /opt/p24-infra && git pull
# 3. Bring up the new blackbox-exporter service
cd monitoring && docker compose up -d blackbox-exporter
# 4. Reload Prometheus (picks up new scrape job + rules/synthetic.yml)
curl -s -X POST http://localhost:9090/-/reload
# 5. Verify in Prometheus UI that the `blackbox` job is UP and probing 7 targets
# Open: https://prometheus.vps-i1.infra.zintegrowana.online/targets
# Expect: 7 instances under job=blackbox, all in state=UP
# 6. Verify Grafana "Synthetic checks" dashboard renders with green status
# Open: https://grafana.vps-i1.infra.zintegrowana.online/d/synthetic-blackbox-v1
# 7. After ~5 min check Alertmanager for any unexpected EndpointDown firing
# Open: https://alertmanager.vps-i1.infra.zintegrowana.online
# If anything fires unexpectedly — open an issue, do NOT silence blindly.Rollback (if needed):
cd /opt/p24-infra/monitoring
docker compose stop blackbox-exporter && docker compose rm -f blackbox-exporter
# Then revert the PR in git, git pull, and reload Prometheus again.