Spec 12 — Cert expiry alerts
Purpose
Caddy/Traefik auto-renew certs. But auto-renewal can fail silently: Cloudflare token expires, Let’s Encrypt rate-limit, ACME challenge port blocked, DNS-01 misconfigured. Today we’d only find out when the cert expires and clients break.
Spec 05 (Blackbox exporter) already gives us probe_ssl_earliest_cert_expiry. We just add an alert.
This spec is small but called out separately because it’s free safety net.
Rulebook
- Two thresholds: warn at 14 days, page at 7 days. ACME renewal happens at 30 days; if we’re under 14, something is wrong.
- All public hostnames covered. Anything in spec 05’s blackbox target list automatically gets this alert via
probe_ssl_earliest_cert_expiry.
Implementation plan
Add to monitoring/prometheus/rules/synthetic.yml:
- alert: CertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 14 * 86400
for: 1h
labels:
severity: warning
annotations:
summary: "Cert for {{ $labels.instance }} expires in <14d"
description: "Check Caddy/Traefik ACME logs."
- alert: CertExpiringCritical
expr: probe_ssl_earliest_cert_expiry - time() < 7 * 86400
for: 10m
labels:
severity: critical
annotations:
summary: "URGENT: cert for {{ $labels.instance }} expires in <7d"Acceptance criteria
- Both alerts visible in Alertmanager UI
- Manually setting
time()forward 25 days in a test rule eval shows the alert would fire (usepromtool test rules)
Cost impact
0 €.
Back-out plan
Delete the two alert rules.
Risks / open questions
None.