IONOS VPS — Operations Runbook

For Hostinger VPS (vps-h1) operations, see hostinger-runbook.md.

Server: 217.154.82.162 (AlmaLinux 9.7, 7.4 GB RAM, 239 GB disk) SSH: ssh root@217.154.82.162 (password in .env.local)


Observation layers

Infra health is observed at three layers, each catching a different failure class. When triaging an alert, identify which layer fired — it tells you what is and isn’t already covered:

  • .github/workflows/health-check.yml — every 6 h. Checks GitHub Actions runner status (API) and Supabase reachability (API). Auto-opens/closes the server-down issue and posts to Discord only on UP↔DOWN transitions.
  • monitoring/prometheus/rules/ — continuous. infrastructure.yml covers VPS/container state (node down, container crashlooping, disk/memory/CPU thresholds); queues.yml covers Supabase queue depths.
  • Prometheus blackbox synthetic probes — continuous HTTP probes for OpenClaw, Traccar, n8n, WAHA (when spec 05 lands). Until then health-check.yml carries belt-and-braces HTTP checks for those endpoints.

Alert: ServerDown

Symptom: Prometheus ServerDown alert fires. Node exporter unreachable.

# 1. Ping the server
ping 217.154.82.162
 
# 2. Try SSH
ssh root@217.154.82.162
 
# 3. If no SSH — log into IONOS Cloud console and check VPS state
#    https://my.ionos.com → VPS → 217.154.82.162
 
# 4. If VPS running but SSH blocked — check firewall
firewall-cmd --list-all
 
# 5. Restart node_exporter if running but not scraping
systemctl restart node_exporter
systemctl status node_exporter

Alert: ContainerCrashLooping / ContainerHighRestarts

Symptom: A Docker container has restarted >3 times in 1h.

ssh root@217.154.82.162
 
# Identify the crashing container
docker ps -a
docker events --filter type=container --filter event=die --since 1h
 
# Check logs
docker logs <container-name> --tail 100
 
# Check healthcheck status
docker inspect <container-name> | python3 -c "import json,sys; d=json.load(sys.stdin); print(d[0]['State']['Health'])"

OpenClaw crashing

cd /root/openclaw
docker compose logs openclaw-gateway --tail 50
docker compose restart openclaw-gateway
 
# If token expired — regenerate and update .env
openssl rand -hex 32   # new OPENCLAW_GATEWAY_TOKEN
nano .env
docker compose up -d

Traccar crashing

cd /root/traccar
docker compose logs traccar --tail 50
 
# DB connection issues?
docker compose logs db --tail 30
docker compose restart db
docker compose restart traccar

Monitoring stack (Prometheus/Grafana/etc.)

cd /opt/p24-infra/monitoring
docker compose logs prometheus --tail 50
docker compose logs grafana --tail 50
docker compose ps
 
# Full restart
docker compose down && docker compose up -d

Alert: LowDisk

Symptom: Disk free < 15% on /.

ssh root@217.154.82.162
 
# Check usage
df -h /
du -sh /var/lib/docker/*
 
# Clean Docker (removes stopped containers, unused images, dangling volumes)
docker system prune -f
 
# Clean old Prometheus TSDB blocks (already uploaded to Thanos/Wasabi)
# Thanos sidecar handles this automatically — check if sidecar is running
docker logs monitoring-thanos-sidecar-1 --tail 30
 
# Clean old logs
journalctl --vacuum-time=7d
find /root -name "*.log" -mtime +30 -delete 2>/dev/null

Alert: HighMemory (>85%)

ssh root@217.154.82.162
 
# Check memory usage
free -h
ps aux --sort=-%mem | head -15
 
# Java (Traccar) is the biggest consumer — check if heap is too large
docker stats traccar --no-stream
# If needed, reduce JAVA_OPTS in /root/traccar/docker-compose.yml
# -Xmx512m → -Xmx384m then: docker compose up -d

Alert: HighCPU (>80% for 5m)

ssh root@217.154.82.162
top -b -n1 | head -20
docker stats --no-stream

Alert: TranscriptionQueueCritical / StuckProcessingJobs

Symptom: Supabase queue depth >200 or jobs stuck >35 min.

# Check queue-exporter logs
docker logs monitoring-queue-exporter-1 --tail 50
 
# Check Supabase directly (connection string in monitoring/.env)
# Look at pgmq queues in Supabase dashboard

Alert: GH Runner offline

Symptom: runner-et or runner-kdp shows as offline in health-check CI.

ssh root@217.154.82.162
 
# Check service status
systemctl status actions.runner.radieu-et-operational-platform.ionos.service
systemctl status actions.runner.radieu-amazon-kdp-tango.kdp-ionos-runner.service
 
# If failed — restart
systemctl restart actions.runner.radieu-et-operational-platform.ionos.service
 
# If "Repository not found" error in logs — token expired
# See: services/github-runners/README.md for re-registration steps
journalctl -u actions.runner.radieu-et-operational-platform.ionos.service --since "1h ago" | tail -30

GH Runner hstgr (Hostinger — hstgr-srv1072950)

Symptom: runner-hstgr shows as offline in health-check CI.

The Hostinger runner (hstgr-srv1072950) runs as a Docker container on vps-h1 (72.60.32.61).

ssh root@72.60.32.61
 
# Check all running containers
docker compose ps
 
# Find the runner container
docker ps | grep runner
 
# Check runner logs
docker logs <runner-container-name> --tail 50
 
# Restart the runner container
docker restart <runner-container-name>
 
# If token expired — re-register via GitHub API
# gh api repos/radieu/et-operational-platform/actions/runners/registration-token -X POST
# Then re-run ./config.sh with the new token inside the container

Alert: n8n Hostinger down / Traefik TLS issue

Symptom: n8n-hstgr shows as FAIL in health-check CI (HTTP != 200).

ssh root@72.60.32.61
 
# Check compose stack status
docker compose ps
 
# Check n8n logs
docker compose logs n8n --tail 50
 
# Restart n8n
docker compose restart n8n
 
# If Traefik can't obtain/renew TLS cert (acme challenge failing):
docker compose logs traefik --tail 50 | grep -i "acme\|cert\|error"
 
# Traefik cert storage
docker volume inspect traefik_data
# Nuke stale acme.json and let Traefik re-request (brief outage):
# docker compose stop traefik
# docker exec -it <traefik-container> rm /letsencrypt/acme.json
# docker compose up -d traefik
 
# Full Hostinger stack restart
docker compose down && docker compose up -d

Thanos / Wasabi S3 upload stalled

docker logs monitoring-thanos-sidecar-1 --tail 50 | grep -E "error|upload|block"
 
# Verify Wasabi credentials are valid
docker exec monitoring-thanos-sidecar-1 \
  thanos tools bucket ls --objstore.config-file /etc/thanos/s3.yml
 
# Restart sidecar
cd /opt/p24-infra/monitoring && docker compose restart thanos-sidecar

Alert: LokiIngestionStopped

Symptom: Prometheus LokiIngestionStopped alert fires — no log lines received by Loki in 15 min (rate(loki_distributor_lines_received_total[15m]) == 0 for 10 min).

Loki sits on vps-i1; both Promtails ship to it via the loki.vps-i1.infra.zintegrowana.online Caddy ingress (basic_auth promtail:$LOKI_PROMTAIL_PASSWORD). A “no logs at all” condition means both Promtails are silent — either both crashed, or the Loki ingestion path is broken.

# 1. Confirm Loki itself is healthy
ssh root@217.154.82.162
cd /opt/p24-infra/monitoring
docker compose ps loki
docker compose logs --tail 100 loki
curl -s http://localhost:3100/ready          # expect: "ready"
curl -s http://localhost:3100/metrics | grep loki_distributor_lines_received_total
 
# 2. Check local Promtail (vps-i1)
docker compose ps promtail-local
docker compose logs --tail 100 promtail-local
# Should see "Adding target" lines for each running container.
# If 'connection refused' to loki:3100 — Loki container is down, see step 1.
 
# 3. Check remote Promtail (vps-h1)
ssh root@72.60.32.61
cd /root
docker compose ps promtail
docker compose logs --tail 100 promtail
# Look for HTTP errors against loki.vps-i1.infra.zintegrowana.online.
# 401 → basic_auth password mismatch; verify LOKI_PROMTAIL_PASSWORD in /root/.env
#       matches the bcrypt hash in monitoring/Caddyfile on vps-i1.
# 5xx / timeout → check Caddy on vps-i1 and Loki health.
 
# 4. Verify network path vps-h1 → vps-i1 ingress
ssh root@72.60.32.61
curl -G -s -o /dev/null -w "%{http_code}\n" \
  "https://loki.vps-i1.infra.zintegrowana.online/loki/api/v1/labels" \
  -u "promtail:$LOKI_PROMTAIL_PASSWORD"
# Expect 200. 401 = wrong password. 502 = Loki down. Timeout = firewall/DNS.
 
# 5. Restart whichever Promtail is silent
docker compose restart promtail        # on vps-h1
# or
docker compose restart promtail-local  # on vps-i1

If both Promtails look healthy but Loki receives nothing: suspect the Caddy basic_auth (re-run caddy hash-password and re-deploy) or the Loki HTTP listener (restart loki).


Disaster Recovery — restore Prometheus data from Wasabi

# 1. List available blocks in Wasabi
docker run --rm \
  -v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
  quay.io/thanos/thanos:latest \
  tools bucket ls --objstore.config-file /s3.yml
 
# 2. Download blocks to local Prometheus dir
docker run --rm \
  -v /opt/p24-infra/monitoring/thanos/s3.yml:/s3.yml:ro \
  -v prometheus-data:/prometheus \
  quay.io/thanos/thanos:latest \
  tools bucket rewrite --objstore.config-file /s3.yml \
  --id <BLOCK_ULID> --output-dir /prometheus
 
# 3. Restart Prometheus
cd /opt/p24-infra/monitoring && docker compose restart prometheus

Alert: BackupStale

Symptom: Prometheus alert BackupStale fires — (time() - backup_last_success_timestamp) > 93600 (26h) for a given host. Backups have not completed for more than a day.

# Identify which host stopped backing up — alert labels carry host="vps-h1" or "vps-i1".
# Look at last lines of backup log on the affected host
ssh root@<vps> 'tail -50 /var/log/p24-backup.log'
 
# Common causes:
#  - Wasabi creds expired -> rotate $WASABI_BACKUP_ACCESS_KEY / $WASABI_BACKUP_SECRET_KEY
#    (edit /root/.backup-env on the VPS; then re-run the script manually)
#  - age key missing -> /root/.age/backup.key gone (re-provision from 1Password)
#  - n8n API down -> check n8n container health (vps-h1 only)
#  - Grafana API token revoked -> rotate GRAFANA_API_TOKEN (vps-i1 only)
#  - Docker volume path changed -> verify mount paths in backup-{hstgr,ionos}.sh
#  - Disk full on VPS -> /tmp out of space; df -h
 
# Manual run to confirm fix (Hostinger):
ssh root@72.60.32.61 '/opt/p24-infra/scripts/backup-hstgr.sh'
 
# Manual run to confirm fix (IONOS):
ssh root@217.154.82.162 '/opt/p24-infra/scripts/backup-ionos.sh'
 
# Verify the success metric was written:
ssh root@<vps> 'cat /var/lib/node_exporter/textfile_collector/backup_last_success_timestamp.prom'

Alert: BackupSizeRegression

Symptom: Prometheus alert BackupSizeRegression fires — backup_last_size_bytes dropped to less than 50% of its 7-day average. Possible silent corruption (e.g., a service stopped, an export endpoint changed, a tar source dir disappeared).

# Compare last few sizes — a single tiny backup is the smoking gun.
ssh root@<vps> 'tail -100 /var/log/p24-backup.log | grep -E "size|SUCCESS|FAILED"'
 
# Common causes:
#  - n8n container stopped/crashed -> SQLite dump is empty, workflows.json is 0 bytes
#  - WAHA volume unmounted or renamed -> waha-session.tar.gz is ~empty
#  - Traccar DB hosed -> mysqldump produced an empty file
#  - Grafana API auth broken -> JSON exports are all error bodies
 
# Drill the latest backup locally to confirm contents (uses the stub today —
# will be replaced post-deployment per spec 01 follow-up):
ssh root@<vps> '/opt/p24-infra/scripts/backup-restore-drill.sh <vps-label>'
 
# Or pull the latest object and inspect it manually:
aws --endpoint-url https://s3.eu-central-1.wasabisys.com s3 ls \
    s3://ecotrans-backups/<vps-label>/ --recursive | tail

Alert: EndpointDown

Symptom: Blackbox synthetic probe (probe_success == 0) for >2 min. Spec 05 covers public endpoints: et-operational-platform Vercel deployments, infra.zintegrowana.online (Grafana), grafana.vps-i1, n8n.vps-h1, waha2.vps-h1, eco-trans.eu.

# 1. Identify the failing target from the alert label `instance`
#    (e.g. https://n8n.vps-h1.infra.zintegrowana.online/healthz)
 
# 2. Probe manually from your workstation
curl -v --max-time 10 <instance-url>
 
# 3. Probe from inside the monitoring stack (eliminates client-side issues)
ssh root@217.154.82.162
docker exec monitoring-blackbox-exporter-1 wget -qO- \
  "http://localhost:9115/probe?module=http_2xx&target=<instance-url>&debug=true" | tail -50
 
# 4. If target is a Vercel deployment — check Vercel dashboard for deployment status
#    https://vercel.com/radieus-projects/et-operational-platform
 
# 5. If target is on a VPS — SSH and check the upstream service:
#    - n8n.vps-h1 → ssh root@72.60.32.61 'docker logs root-n8n-1 --tail=50'
#    - waha2.vps-h1 → ssh root@72.60.32.61 'docker logs waha --tail=50'
#    - grafana.vps-i1 → ssh root@217.154.82.162 'docker logs monitoring-grafana-1 --tail=50'
 
# 6. Reverse proxy layer — check Caddy (IONOS) or Traefik (Hostinger)
ssh root@217.154.82.162 'docker logs monitoring-caddy-1 --tail=100'
ssh root@72.60.32.61   'docker logs root-traefik-1 --tail=100'
 
# 7. DNS — confirm the hostname still resolves
dig +short <hostname>

Alert: EndpointSlow

Symptom: Blackbox probe latency probe_duration_seconds > 2 for 5 min on a target. Not an outage, but a degradation signal — could be cold Vercel starts, an overloaded VPS, or upstream API throttling.

# 1. Identify target + duration trend from the Grafana "Synthetic checks" dashboard
#    https://grafana.vps-i1.infra.zintegrowana.online/d/synthetic-blackbox-v1
 
# 2. Time the request locally
curl -o /dev/null -s -w "total=%{time_total}s connect=%{time_connect}s tls=%{time_appconnect}s ttfb=%{time_starttransfer}s\n" <instance-url>
 
# 3. If it's a Vercel app — check function runtime in Vercel dashboard
#    (cold start vs. warm; consider Vercel Functions logs)
 
# 4. If it's a VPS target — check load on the host
ssh <vps> 'uptime && top -bn1 | head -20'
 
# 5. Inspect blackbox `debug=true` output for which phase is slow
ssh root@217.154.82.162
docker exec monitoring-blackbox-exporter-1 wget -qO- \
  "http://localhost:9115/probe?module=http_2xx&target=<instance-url>&debug=true"
 
# 6. If steady-state >2s for >1h, escalate to an issue; if transient, snooze the alert

Alert: Cost alerts (BudgetWarning family)

Symptoms: One of VercelInvocationsApproachingFreeTier, SupabaseDbSizeApproachingProTier, WasabiBucketGrowthSpike, or CostCollectorStale fires. The cost-exporter (spec 11) pulls provider-side usage daily and these rules surface budget pressure before the monthly bill lands.

# 1. Open the Costs dashboard
#    https://grafana.vps-i1.infra.zintegrowana.online/d/costs-v1
#    Identify which provider tripped the threshold + the trend.
 
# 2. Check exporter health
ssh root@217.154.82.162
docker logs monitoring-cost-exporter-1 --tail 80
curl -s http://localhost:9210/metrics | grep -E '^(cost_collector_errors_total|cost_collector_last_success_timestamp_seconds)'
 
# 3. Decide: scale up the plan, drop usage, or just adjust the threshold.
#    Examples:
#    - Vercel near 100k/month — confirm Hobby vs Pro is right; bump alert
#      threshold to 80% of new plan's quota.
#    - Supabase DB near 8 GB — run `VACUUM`, archive old rows, or accept
#      the $0.125/GB overage and raise the threshold.
#    - Wasabi spike — find which service is uploading the new data
#      (Thanos? n8n backups? log shipper?). Almost always a misconfig.
 
# 4. Token refresh — if `cost_collector_errors_total{collector="..."}` is
#    incrementing every cycle, the upstream token has likely been revoked.
#    Rotate per the BOOTSTRAP section of docs/improvements/11-cost-dashboard.md.
 
# 5. Document the resolution in the issue or as a comment on the alert.

Routine maintenance

After every deploy of one of our custom images (pdf-service, queue-exporter, report-scheduler), glance at Grafana → Container versions dashboard to confirm the new git SHA is live. See spec 10 for the rebuild commands.

Monthly OS updates (AlmaLinux)

ssh root@217.154.82.162
dnf check-update
dnf update -y
reboot   # if kernel updated

Docker image updates

ssh root@217.154.82.162
 
# Pull latest images for each service
cd /opt/p24-infra/monitoring && docker compose pull && docker compose up -d
cd /root/traccar && docker compose pull && docker compose up -d
 
# OpenClaw: built locally — update from source
cd /root/openclaw && git pull && docker build -t openclaw:local . && docker compose up -d

Check for stale images

docker images --format "{{.Repository}}:{{.Tag}}\t{{.CreatedAt}}" | sort
docker image prune -f

Alert: Trivy CRITICAL findings

Source: nightly Trivy image scan workflow (04:00 UTC). On CRITICAL findings it opens (or comments on) a security,bug GitHub issue and posts a Discord summary.

CVE policy (spec 08): CRITICAL → fix within 7 days. HIGH → within 30. MEDIUM/LOW → batched quarterly.

Procedure

  1. Pull the report. Open the linked workflow run, download the trivy-reports-<run-id> artefact.
  2. Identify CVEs. Open <image>__<tag>.json, search for "Severity": "CRITICAL" entries; note CVE IDs, affected packages, and FixedVersion.
  3. Check for upstream fix. Visit the image’s Docker Hub / quay page or upstream repo. If a patched tag exists:
    • Manually bump the tag in the relevant docker-compose.yml, or
    • Wait for the next Renovate PR (weekend schedule) and review.
  4. No upstream patch yet? Assess exploitability in our context:
    • Is the vulnerable code path reachable from our deployment? (e.g. CVE in a CLI subcommand we never invoke = not exploitable)
    • Reachable → mitigate: drop the feature, add Caddy/Traefik WAF rule, restrict network, or replace the image.
    • Not reachable → document the deferral as a comment on the GH issue with rationale; revisit weekly.
  5. Verify. After fix is deployed, the next nightly run shows fewer findings; the issue auto-comments with the new count. Close the issue manually once CRITICAL count is 0.

Manually trigger a scan

gh workflow run trivy-scan.yml --repo radieu/p24-infra
gh run watch --repo radieu/p24-infra

Alert: SecretsSyncFailed

Symptom: The .github/workflows/secrets-sync.yml workflow run finished with status failure after a push to main touching secrets/** or .sops.yaml.

# 1. Open the failed run
gh run list --workflow secrets-sync.yml --limit 5
gh run view <RUN_ID> --log-failed | head -200
 
# 2. Most common causes
#    a) AGE_KEY_GHA missing or malformed -> "no age keys found"
#    b) VPS_SSH_PRIVATE_KEY / VPS_ROOT_SSH_KEY missing or wrong -> "Permission denied (publickey)"
#    c) sops file shape changed (wrong recipient in .sops.yaml) -> "no key could decrypt the data"
 
# 3. Verify GH Secrets exist
gh secret list --repo radieu/p24-infra | grep -E 'AGE_KEY_GHA|VPS_SSH|P24_INFRA_GH_TOKEN'
 
# 4. Re-run the workflow once the underlying cause is fixed
gh workflow run secrets-sync.yml --repo radieu/p24-infra
 
# 5. If the issue is .sops.yaml recipient drift:
#    Locally:  sops -d secrets/<file>.sops.yaml      (must succeed with your personal key)
#    Then:     sops updatekeys secrets/<file>.sops.yaml
#    Commit + push - sync workflow re-runs automatically.

Important: Never paste decrypted values into the failed-run UI to debug. Use sha256sum | head -c 12 fingerprints to verify continuity instead.


Alert: AgeKeyMissing

Symptom: A VPS service fails to start because /root/.age/secrets.key is missing, or the boot-time sops -d step exits with failed to load age private key.

ssh root@<vps-ip>
 
# 1. Confirm the file is genuinely gone
ls -l /root/.age/
# Expected: secrets.key (mode 0600, owned by root)
 
# 2. Restore from 1Password backup
#    Open 1Password -> "p24-infra age - <vps-label>" -> copy the AGE-SECRET-KEY-1... line
mkdir -p /root/.age
cat > /root/.age/secrets.key <<'KEY'
# created: ...
# public key: age1...
AGE-SECRET-KEY-1...
KEY
chmod 600 /root/.age/secrets.key
chown root:root /root/.age/secrets.key
 
# 3. Verify it can decrypt
SOPS_AGE_KEY_FILE=/root/.age/secrets.key sops -d /opt/p24-infra/secrets/shared.sops.yaml | head -3
 
# 4. Re-run secrets-sync to regenerate /opt/p24-infra/monitoring/.env locally
gh workflow run secrets-sync.yml --repo radieu/p24-infra

If 1Password is also unavailable, this VPS has lost the ability to decrypt. Mitigation: generate a new keypair, add the new public key to .sops.yaml from a still-working recipient (developer machine), run sops updatekeys secrets/*.sops.yaml, commit, push - sync workflow re-encrypts and ships.


Procedure: emergency secret rotation

Use when a secret is known-compromised (leaked in a public commit, screen-share, LLM session, chat).

# 1. Revoke at the source FIRST (before anything else)
#    - Anthropic Console -> API keys -> Revoke
#    - GitHub -> Settings -> Developer settings -> PATs -> Revoke
#    - Supabase -> Project settings -> API -> Roll service_role
#    - Vercel -> Settings -> Tokens -> Delete
#    - Sentry -> Settings -> Auth Tokens -> Revoke
 
# 2. Generate new value at the same provider, copy to clipboard
 
# 3. Update sops file (sops auto-encrypts on save)
sops edit secrets/shared.sops.yaml
# ...paste new value...
 
# 4. Commit + push
git commit -am "fix(secrets): rotate <SECRET_NAME> - compromised"
git push origin main
 
# 5. Watch the sync workflow to green
gh run watch
 
# 6. Verify the value live on the VPS (fingerprint only - never echo the value)
ssh root@<vps-ip> 'grep <KEY> /opt/p24-infra/monitoring/.env | sha256sum | head -c 12'
 
# 7. Append to docs/secrets-rotation-log.md
#    | 2026-MM-DD | <SECRET_NAME> | compromise | <handle> | yes |
 
# 8. If the secret was committed in plaintext at any point in history, assume the
#    old value is permanently public - rotation is the only safe response.

Timeline target: revoke -> new value live -> log entry - within 60 minutes of detection.


Alert: N8nSnapshotStale (issue: n8n-snapshot label)

Symptom: A GitHub issue with label n8n-snapshot is open — the nightly n8n workflow snapshot workflow (04:00 UTC, hstgr runner) failed. The workflow comments on the existing issue rather than opening duplicates.

# 1. Check workflow runs:
gh run list --workflow=n8n-workflow-snapshot.yml --repo radieu/p24-infra --limit 5
 
# 2. Inspect the latest failed run:
gh run view <RUN_ID> --log-failed --repo radieu/p24-infra
 
# 3. Common causes:
#    a. N8N_API_KEY_HSTGR expired -> rotate via sops (spec 03), then re-sync
#    b. n8n container down on vps-h1 -> ssh root@72.60.32.61 'docker ps | grep n8n'
#    c. hstgr self-hosted runner offline -> see "GH Runner offline" section above

After a successful re-run posts a green snapshot commit, close the n8n-snapshot issue manually.


Alert: AnsibleDriftDetected

Symptom: Weekly ansible-drift.yml workflow (Mondays 06:00 UTC) opens a drift-detected issue. The live state of one or more VPSes diverges from the declared Ansible state.

# 1. Open the drift-detected GitHub issue and look at the failed run's --diff output.
gh issue list --label drift-detected --state open
# Follow the "Run:" URL in the issue body to the workflow logs.
 
# 2. Decide: is the drift INTENTIONAL (someone made a manual change that needs codification)
#    or UNINTENDED (manual change should be reverted)?
 
# Intentional case — codify the new state:
cd ansible
# edit the relevant role to reflect the new state
ansible-playbook playbooks/<host>.yml --check --diff   # should now show zero diff
git commit -m "ansible: codify <change> on <host>"
 
# Unintended case — re-converge to declared state:
cd ansible
ansible-playbook playbooks/<host>.yml --diff           # APPLIES — re-converges to declared state
# Investigate WHO made the manual change. Document in the closed issue.
 
# 3. Close the drift issue with a comment linking the converging commit/PR.
gh issue close <N> --comment "Resolved by <SHA>: <intentional codification | re-applied playbook>"

Common drift causes:

  • Someone ran apt install / dnf install directly on a VPS — back-port to role common or docker.
  • A package auto-updated (e.g. docker-ce minor version bump) — usually safe to re-converge.
  • Cron entry edited by hand on the VPS — back-port to roles/claude-runner/templates/claude-nightly.sh.j2 or the cron task that owns it.

Never suppress drift detection by tagging out the divergent role — fix the source of truth instead.


Alert: SSHBruteForceSurge

Symptom: fail2ban has banned more than 5 distinct IPs from the sshd jail within the last hour (suggests a targeted brute-force campaign rather than the usual background internet noise).

# 1. Confirm scale on the affected VPS
ssh root@<vps-ip>
sudo fail2ban-client status sshd
# Look at the "Total banned" and "Banned IP list" lines.
 
# 2. Check raw auth log to characterise the attack
#    - Ubuntu (vps-h1):
sudo tail -200 /var/log/auth.log | grep -E "Failed|Invalid"
#    - AlmaLinux (vps-i1):
sudo tail -200 /var/log/secure | grep -E "Failed|Invalid"
 
# 3. Look for patterns: same username (e.g. "root"), same IP block, geographic clustering.
#    If the attack is hitting non-existent users (admin, ubuntu, test) it's a generic scanner —
#    leave fail2ban to handle it. If it's hitting actual usernames (claude-admin, claude-runner)
#    that's targeted — tighten the jail.
 
# 4. Tighten the jail temporarily (lower maxretry, longer bantime).
#    Edit ansible/roles/common/defaults/main.yml — bump fail2ban_bantime to e.g. 86400 (24h)
#    and fail2ban_maxretry to 3. Run --check --diff then --diff.
 
# 5. If a specific subnet is hammering, drop it at the firewall layer:
#    - vps-h1: sudo ufw insert 1 deny from <CIDR>
#    - vps-i1: sudo firewall-cmd --add-rich-rule='rule family=ipv4 source address=<CIDR> drop' --permanent && sudo firewall-cmd --reload
 
# 6. After the storm passes, revert the temporary tighter thresholds via the same Ansible flow.

If brute-force activity persists at high volume, accelerate spec 09 Phase 2 (CF Access SSH tunnel) — closing port 22 to the public internet ends the problem entirely.


Procedure: emergency SSH lockout recovery

Use when the daemon was restarted with a config that locks out all key-based access (e.g. accidentally set AllowUsers to a non-existent user, or PermitRootLogin no before claude-admin was working). The Ansible role uses validate: 'sshd -t -f %s' to prevent syntactically invalid configs from ever being written, but a valid config can still lock you out logically.

# 1. Open the provider Cloud Console (NOT SSH — that's broken).
#    - IONOS (vps-i1): https://my.ionos.com → VPS → 217.154.82.162 → KVM console / Remote Access
#    - Hostinger (vps-h1): hPanel → VPS → Server → Browser terminal
 
# 2. Log in as root at the local TTY. Local console isn't SSH-gated; even if SSH password
#    auth is disabled, the root password (or VPS-provider-set console password) still works
#    here. If you don't have the root password, reset it via the provider's panel.
 
# 3. Restore the previous sshd_config from the Ansible-managed backup:
ls -lt /etc/ssh/sshd_config*.bak    # find the most recent
cp /etc/ssh/sshd_config.<timestamp>.bak /etc/ssh/sshd_config
sshd -t                              # validate before restart
systemctl restart sshd
 
# 4. From your workstation, verify SSH works again:
ssh -i ~/.ssh/id_ed25519 root@<vps-ip>
 
# 5. Open a GitHub issue documenting WHY the lockout happened. Likely root causes:
#    - host_vars override referenced a user that doesn't exist on the host yet
#    - a custom Match block was added that excludes the operator's IP
#    - claude-admin role didn't run (or claude_admin_user_enabled is still false) before sshd hardening
 
# 6. Once the offending Ansible state is fixed, re-run with --check --diff to confirm zero diff,
#    then with --diff to re-apply intended hardening — keeping a safety SSH session open.

Procedure: unban an IP

Used after a known-good IP gets caught by fail2ban (e.g. you forgot which key to use and burned through MaxAuthTries) or to clean up after a deliberate brute-force test.

ssh root@<vps-ip>
 
# 1. Confirm the IP is banned
sudo fail2ban-client status sshd
# Look for the IP under "Banned IP list".
 
# 2. Unban it
sudo fail2ban-client unban <ip>
 
# 3. Confirm removal
sudo fail2ban-client status sshd
# The IP should no longer appear in the Banned IP list.
 
# 4. Or unban *all* IPs in the sshd jail at once (use with care):
sudo fail2ban-client unban --all

The ban is also dropped automatically when bantime expires (1h by default — see fail2ban_bantime in ansible/roles/common/defaults/main.yml).