Spec 05 — Synthetic monitoring (Blackbox exporter)

Purpose

We currently probe internal services on a 6-hour cadence. User-visible URLs (Vercel projects, public Grafana at infra.zintegrowana.online, n8n, WAHA) are only “noticed” when Vercel or Cloudflare emails us, or when a user complains. Blackbox exporter adds 30-second HTTP probes from inside the monitoring stack, feeding Prometheus and Grafana — same alerting pipeline as everything else.

It also produces the data needed for spec 12 (cert expiry alerts).


Rulebook

  1. Probe targets list lives in git. monitoring/prometheus/prometheus.yml under blackbox_targets.
  2. One module per protocol. http_2xx, http_2xx_post, tcp_connect, dns_query — defined once, reused.
  3. No deep checks from blackbox. Login flows, multi-step probes — those belong in Playwright. Blackbox is for “is it up + valid TLS + responds <2 s”.
  4. Probe from monitoring host only. Don’t run blackbox on each VPS; single observability vantage is cheaper and consistent.

Architecture

                  ┌─ Prometheus (vps-i1) ─┐
                  │  scrapes every 30s    │
                  └──────────┬────────────┘
                             │
                             ▼
                  ┌──── blackbox-exporter ────┐
                  │  ports HTTP/TCP probes to │
                  │  N public/internal URLs   │
                  └────────────┬──────────────┘
                               │
        ┌─────────────────┬────┴─────┬──────────────────┐
        ▼                 ▼          ▼                  ▼
 et-oper.vercel.app  monitoring   n8n.vps-h1   waha2.vps-h1
                     .eco-trans                .infra...
                     .eu

Targets (initial):

URLModuleExpected
https://et-operational-platform.vercel.app/api/healthhttp_2xx200, <2s
https://et-operational-platform-7ktl.vercel.app/api/healthhttp_2xx200, <2s
https://grafana.vps-i1.infra.zintegrowana.onlinehttp_2xx200/302
https://n8n.vps-h1.infra.zintegrowana.online/healthzhttp_2xx200
https://waha2.vps-h1.infra.zintegrowana.online/api/healthhttp_2xx200
https://eco-trans.euhttp_2xx200

Implementation plan

  1. Add blackbox-exporter service to monitoring/docker-compose.yml.
  2. Create monitoring/prometheus/blackbox.yml with the 4 modules (http_2xx, http_2xx_post, tcp_connect, dns_query).
  3. Add scrape config block to monitoring/prometheus/prometheus.yml using the relabel_configs trick.
  4. Add alerts to monitoring/prometheus/rules/synthetic.yml:
    • EndpointDownprobe_success == 0 for 2m — severity critical
    • EndpointSlowprobe_duration_seconds > 2 for 5m — severity warning
    • CertExpiringSoonprobe_ssl_earliest_cert_expiry - time() < 7*24*3600 — severity warning (closes spec 12)
  5. Provision Grafana dashboard “Synthetic checks” with one row per target.

Acceptance criteria

  • Prometheus targets page shows 7 blackbox jobs all UP
  • Grafana “Synthetic checks” dashboard renders with green status for all targets
  • Manually stopping et-operational-platform-7ktl Vercel deployment triggers EndpointDown within 3 min
  • probe_ssl_earliest_cert_expiry series exists for all https targets

Cost impact

0 €. One small container.

Back-out plan

Remove blackbox-exporter service, remove scrape config + rules. No data loss elsewhere; only the synthetic series disappears.

Risks / open questions

  • Risk: Probing Vercel from a single egress IP could trigger rate-limits. Mitigation: 30 s interval is well under any sane limit; abort if seen.
  • Q: Should we probe et-operational-platform from EU and US? A: No, single vantage is fine for our scale.

Bootstrap (post-merge deployment)

After this PR is merged to main, a human runs the following on vps-i1 to bring the new blackbox-exporter container online and have Prometheus pick up the new scrape job + rules + dashboard.

# 1. SSH to vps-i1 as root
ssh -i C:\Users\konar\.ssh\id_ed25519 root@217.154.82.162
 
# 2. Pull the merged changes
cd /opt/p24-infra && git pull
 
# 3. Bring up the new blackbox-exporter service
cd monitoring && docker compose up -d blackbox-exporter
 
# 4. Reload Prometheus (picks up new scrape job + rules/synthetic.yml)
curl -s -X POST http://localhost:9090/-/reload
 
# 5. Verify in Prometheus UI that the `blackbox` job is UP and probing 7 targets
#    Open: https://prometheus.vps-i1.infra.zintegrowana.online/targets
#    Expect: 7 instances under job=blackbox, all in state=UP
 
# 6. Verify Grafana "Synthetic checks" dashboard renders with green status
#    Open: https://grafana.vps-i1.infra.zintegrowana.online/d/synthetic-blackbox-v1
 
# 7. After ~5 min check Alertmanager for any unexpected EndpointDown firing
#    Open: https://alertmanager.vps-i1.infra.zintegrowana.online
#    If anything fires unexpectedly — open an issue, do NOT silence blindly.

Rollback (if needed):

cd /opt/p24-infra/monitoring
docker compose stop blackbox-exporter && docker compose rm -f blackbox-exporter
# Then revert the PR in git, git pull, and reload Prometheus again.