Spec 12 — Cert expiry alerts

Purpose

Caddy/Traefik auto-renew certs. But auto-renewal can fail silently: Cloudflare token expires, Let’s Encrypt rate-limit, ACME challenge port blocked, DNS-01 misconfigured. Today we’d only find out when the cert expires and clients break.

Spec 05 (Blackbox exporter) already gives us probe_ssl_earliest_cert_expiry. We just add an alert.

This spec is small but called out separately because it’s free safety net.


Rulebook

  1. Two thresholds: warn at 14 days, page at 7 days. ACME renewal happens at 30 days; if we’re under 14, something is wrong.
  2. All public hostnames covered. Anything in spec 05’s blackbox target list automatically gets this alert via probe_ssl_earliest_cert_expiry.

Implementation plan

Add to monitoring/prometheus/rules/synthetic.yml:

- alert: CertExpiringSoon
  expr: probe_ssl_earliest_cert_expiry - time() < 14 * 86400
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Cert for {{ $labels.instance }} expires in <14d"
    description: "Check Caddy/Traefik ACME logs."
 
- alert: CertExpiringCritical
  expr: probe_ssl_earliest_cert_expiry - time() < 7 * 86400
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "URGENT: cert for {{ $labels.instance }} expires in <7d"

Acceptance criteria

  • Both alerts visible in Alertmanager UI
  • Manually setting time() forward 25 days in a test rule eval shows the alert would fire (use promtool test rules)

Cost impact

0 €.

Back-out plan

Delete the two alert rules.

Risks / open questions

None.