Spec 02 — Centralized logs (Loki + Promtail)

Purpose

Today, debugging means SSHing to a VPS and running docker logs <container> --tail 200. There is no cross-container search, no retention beyond the per-container json-file rotation (max-size: 50m, max-file: 3 = ~150 MB tail), and no way to correlate a Prometheus alert with the log lines that caused it.

Loki is “Prometheus for logs” — labels, LogQL, integrates natively with the existing Grafana. Adding it is one extra container per VPS plus one new datasource.


Rulebook (operating rules)

  1. Logs are not for secrets. Configure each service to never log secrets. Promtail has no filtering for secret values; the right fix is at the source.
  2. Retention: 14 days hot, 30 days cold. Hot = local volume; cold = Wasabi via Loki’s S3 backend.
  3. Labels are immutable schema. Don’t add high-cardinality labels (user IDs, request IDs as labels). Cardinality explosions kill Loki. Use label values for things like container_name, service, severity.
  4. One Loki, multiple Promtails. Single Loki on vps-i1 (monitoring host); Promtail agent on every VPS that has containers.

Architecture

vps-i1 containers ─► Promtail (vps-i1) ──┐
vps-h1 containers ─► Promtail (vps-h1) ──┤
                                          ▼
                                       ┌──────┐
                                       │ Loki │  ← vps-i1
                                       └──┬───┘
                                          │
                              backend storage
                                          │
                            ┌─────────────┴──────────────┐
                       local FS (14d)              Wasabi (30d)
                                          │
                                          ▼
                                   Grafana datasource
                                   "Explore" + correlation
                                   with Prometheus metrics

Implementation plan

  1. Add loki service to monitoring/docker-compose.yml (single binary mode, filesystem + S3 backend).
  2. Add promtail to monitoring/docker-compose.yml (scrapes local Docker socket).
  3. On vps-h1, add promtail to hostinger/docker-compose.yml (ships to loki.vps-i1.infra.zintegrowana.online).
  4. Caddy: add loki.vps-i1.infra.zintegrowana.online route, basic_auth protected (Promtail bearer token).
  5. Grafana provisioning: add Loki datasource (monitoring/grafana/provisioning/datasources/loki.yml).
  6. Provision dashboard: “Container logs by service” with severity filter + freetext search.
  7. Add alert: LokiIngestionStopped — no logs received from a known service in 10 min.

Acceptance criteria

  • Grafana → Explore → Loki datasource returns logs for both vps-i1 and vps-h1 containers
  • {container="root-n8n-1"} |= "error" returns matches when n8n logs an error
  • Stopping Promtail on vps-h1 triggers LokiIngestionStopped alert within 15 min
  • Disk usage of Loki volume <2 GB after 14 days (verify retention rolls correctly)
  • docs/runbook.md includes “How to grep logs” recipe pointing to Grafana

Cost impact

Wasabi cold storage for 30-day overflow: ~1–2 GB/month → 0.01 €/month. Functionally free.

Back-out plan

Remove loki, promtail services from compose files; remove Loki datasource from Grafana; delete loki_data volume. No service downtime, no data loss elsewhere.

Risks / open questions

  • Risk: Promtail mis-config could ship secrets to Loki. Mitigation: review service log output in PR; add pipeline_stages to drop lines matching secret patterns as defense-in-depth.
  • Q: Vector vs Promtail? A: Promtail — same vendor as Loki, simpler, sufficient for our log volume (<1 GB/day).

Bootstrap

Deployment is manual after the PR with artifacts is merged. Steps must run in order; each VPS only when the previous one is healthy.

Step 1 — Generate the shared Promtail password

On any machine (paste output into both .env files in steps 2–3):

LOKI_PROMTAIL_PASSWORD=$(openssl rand -hex 24)
echo "$LOKI_PROMTAIL_PASSWORD"

Step 2 — Bcrypt-hash the password for Caddy basic_auth

docker run --rm caddy:2.8-alpine caddy hash-password --plaintext "$LOKI_PROMTAIL_PASSWORD"

Open monitoring/Caddyfile and replace the placeholder {bcrypt-hash-of-LOKI_PROMTAIL_PASSWORD} (one line, inside the loki.vps-i1.infra.zintegrowana.online block) with the bcrypt output. Commit that one-line change to main (or as a follow-up PR).

Step 3 — Deploy on vps-i1 (IONOS) — Loki + local Promtail

ssh root@217.154.82.162
cd /opt/p24-infra
git pull
# Add password to .env (first time only)
grep -q LOKI_PROMTAIL_PASSWORD monitoring/.env || echo "LOKI_PROMTAIL_PASSWORD=$LOKI_PROMTAIL_PASSWORD" >> monitoring/.env
cd monitoring
docker compose up -d loki promtail-local
# Reload Caddy to pick up the new vhost
docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile
docker compose ps loki promtail-local

Step 4 — Deploy on vps-h1 (Hostinger) — remote Promtail

ssh root@72.60.32.61
cd /opt/p24-infra
git pull
# Add password to /root/.env (used by the root-level compose)
grep -q LOKI_PROMTAIL_PASSWORD /root/.env || echo "LOKI_PROMTAIL_PASSWORD=$LOKI_PROMTAIL_PASSWORD" >> /root/.env
# The hostinger compose mounts ./promtail relative to /root — copy the config in
mkdir -p /root/promtail
cp /opt/p24-infra/hostinger/promtail/promtail-remote.yml /root/promtail/promtail-remote.yml
cd /root
docker compose up -d promtail
docker compose logs --tail 30 promtail

Step 5 — Verify ingestion via Caddy ingress

curl -G -s "https://loki.vps-i1.infra.zintegrowana.online/loki/api/v1/labels" \
  -u "promtail:$LOKI_PROMTAIL_PASSWORD"

Expect a JSON list including container_name, host, service, stream.

Step 6 — Verify in Grafana

  1. Open https://grafana.vps-i1.infra.zintegrowana.online.
  2. Explore → datasource Loki → run {host="vps-i1"} → logs appear.
  3. Re-run with {host="vps-h1"} → logs from Hostinger containers appear.
  4. Open the “Container logs” dashboard and filter by host/container/search.

If either host returns nothing, check Promtail logs on the silent VPS first — see docs/runbook.mdAlert: LokiIngestionStopped.