Spec 05 — Synthetic monitoring (Blackbox exporter)

Purpose

We currently probe internal services on a 6-hour cadence. User-visible URLs (Vercel projects, public Grafana at infra.zintegrowana.online, n8n, WAHA) are only “noticed” when Vercel or Cloudflare emails us, or when a user complains. Blackbox exporter adds 30-second HTTP probes from inside the monitoring stack, feeding Prometheus and Grafana — same alerting pipeline as everything else.

It also produces the data needed for spec 12 (cert expiry alerts).

Rulebook

Probe targets list lives in git. monitoring/prometheus/prometheus.yml under blackbox_targets.
One module per protocol. http_2xx, http_2xx_post, tcp_connect, dns_query — defined once, reused.
No deep checks from blackbox. Login flows, multi-step probes — those belong in Playwright. Blackbox is for “is it up + valid TLS + responds <2 s”.
Probe from monitoring host only. Don’t run blackbox on each VPS; single observability vantage is cheaper and consistent.

Architecture

                  ┌─ Prometheus (vps-i1) ─┐
                  │  scrapes every 30s    │
                  └──────────┬────────────┘
                             │
                             ▼
                  ┌──── blackbox-exporter ────┐
                  │  ports HTTP/TCP probes to │
                  │  N public/internal URLs   │
                  └────────────┬──────────────┘
                               │
        ┌─────────────────┬────┴─────┬──────────────────┐
        ▼                 ▼          ▼                  ▼
 et-oper.vercel.app  monitoring   n8n.vps-h1   waha2.vps-h1
                     .eco-trans                .infra...
                     .eu

Targets (initial):

URL	Module	Expected
`https://et-operational-platform.vercel.app/api/health`	http_2xx	200, <2s
`https://et-operational-platform-7ktl.vercel.app/api/health`	http_2xx	200, <2s
`https://grafana.vps-i1.infra.zintegrowana.online`	http_2xx	200/302
`https://n8n.vps-h1.infra.zintegrowana.online/healthz`	http_2xx	200
`https://waha2.vps-h1.infra.zintegrowana.online/api/health`	http_2xx	200
`https://eco-trans.eu`	http_2xx	200

Implementation plan

Add blackbox-exporter service to monitoring/docker-compose.yml.
Create monitoring/prometheus/blackbox.yml with the 4 modules (http_2xx, http_2xx_post, tcp_connect, dns_query).
Add scrape config block to monitoring/prometheus/prometheus.yml using the relabel_configs trick.
Add alerts to monitoring/prometheus/rules/synthetic.yml:
- EndpointDown — probe_success == 0 for 2m — severity critical
- EndpointSlow — probe_duration_seconds > 2 for 5m — severity warning
- CertExpiringSoon — probe_ssl_earliest_cert_expiry - time() < 7*24*3600 — severity warning (closes spec 12)
Provision Grafana dashboard “Synthetic checks” with one row per target.

Acceptance criteria

Prometheus targets page shows 7 blackbox jobs all UP
Grafana “Synthetic checks” dashboard renders with green status for all targets
Manually stopping et-operational-platform-7ktl Vercel deployment triggers EndpointDown within 3 min
probe_ssl_earliest_cert_expiry series exists for all https targets

Cost impact

0 €. One small container.

Back-out plan

Remove blackbox-exporter service, remove scrape config + rules. No data loss elsewhere; only the synthetic series disappears.

Risks / open questions

Risk: Probing Vercel from a single egress IP could trigger rate-limits. Mitigation: 30 s interval is well under any sane limit; abort if seen.
Q: Should we probe et-operational-platform from EU and US? A: No, single vantage is fine for our scale.

Bootstrap (post-merge deployment)

After this PR is merged to main, a human runs the following on vps-i1 to bring the new blackbox-exporter container online and have Prometheus pick up the new scrape job + rules + dashboard.

# 1. SSH to vps-i1 as root
ssh -i C:\Users\konar\.ssh\id_ed25519 root@217.154.82.162
 
# 2. Pull the merged changes
cd /opt/p24-infra && git pull
 
# 3. Bring up the new blackbox-exporter service
cd monitoring && docker compose up -d blackbox-exporter
 
# 4. Reload Prometheus (picks up new scrape job + rules/synthetic.yml)
curl -s -X POST http://localhost:9090/-/reload
 
# 5. Verify in Prometheus UI that the `blackbox` job is UP and probing 7 targets
#    Open: https://prometheus.vps-i1.infra.zintegrowana.online/targets
#    Expect: 7 instances under job=blackbox, all in state=UP
 
# 6. Verify Grafana "Synthetic checks" dashboard renders with green status
#    Open: https://grafana.vps-i1.infra.zintegrowana.online/d/synthetic-blackbox-v1
 
# 7. After ~5 min check Alertmanager for any unexpected EndpointDown firing
#    Open: https://alertmanager.vps-i1.infra.zintegrowana.online
#    If anything fires unexpectedly — open an issue, do NOT silence blindly.

Rollback (if needed):

cd /opt/p24-infra/monitoring
docker compose stop blackbox-exporter && docker compose rm -f blackbox-exporter
# Then revert the PR in git, git pull, and reload Prometheus again.

p24-infra Docs

Explorer

Synthetic monitoring (Blackbox exporter)