Hostinger VPS — Operations Runbook
Server: 72.60.32.61 (vps-h1, hostname srv1072950.hstgr.cloud, Ubuntu 24.04.3 LTS, 8 GB RAM, 2 vCPU AMD EPYC 9354P)
SSH: ssh -i C:\Users\konar\.ssh\id_ed25519 root@72.60.32.61 (Windows) — or ssh -i ~/.ssh/id_ed25519 root@72.60.32.61 (Linux)
No
claude-adminscoped user exists on this host yet (only on IONOS). All ops below assume root via the localid_ed25519key. UFW is currently inactive — exposure is gated only at the Docker port-binding layer (127.0.0.1:*) and Traefik labels.
Compose root: /root/docker-compose.yml (tracked at hostinger/docker-compose.yml).
Services on this host: root-traefik-1, root-n8n-1, waha, claude-proxy, root-node-exporter-1, root-cadvisor-1.
Alert: ServerDown
Symptom: Prometheus ServerDown alert fires for instance="72.60.32.61:9100". Node exporter unreachable.
# 1. Ping the server
ping 72.60.32.61
# 2. Try SSH
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
# 3. If no SSH — log into Hostinger panel and check VPS state
# https://hpanel.hostinger.com → VPS → srv1072950.hstgr.cloud
# Use the panel's serial/recovery console to inspect kernel state.
# 4. If VPS running but SSH blocked — check firewall (should be inactive)
ufw status verbose
iptables -L -n | head -40
# 5. Restart node-exporter container if VPS is up but Prometheus can't scrape
cd /root && docker compose ps node-exporter
cd /root && docker compose restart node-exporter
curl -s http://127.0.0.1:9100/metrics | head -5 # sanity checkAlert: ContainerCrashLooping / ContainerHighRestarts
Symptom: Prometheus fires ContainerHighRestarts (>3 restarts in 1h) or ContainerCrashLooping (>5 restarts in 30m) for a container on vps-h1.
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
# Identify the crashing container
docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.RestartCount}}"
docker events --filter type=container --filter event=die --since 1h
# Generic recovery: tail logs, inspect health, then restart
docker logs <container-name> --tail 100
docker inspect <container-name> | python3 -c "import json,sys; d=json.load(sys.stdin); print(d[0]['State'])"
cd /root && docker compose restart <service-name>root-traefik-1 crashing
cd /root
docker compose logs traefik --tail 100
docker compose logs traefik --tail 200 | grep -iE "acme|cert|error|panic"
# Common cause: ACME / Let's Encrypt rate-limit or DNS failure on cert renew.
docker volume inspect traefik_data
docker run --rm -v traefik_data:/data alpine ls -la /data # inspect acme.json
# If acme.json is corrupted or rate-limited — back it up and restart:
docker compose stop traefik
docker run --rm -v traefik_data:/data alpine sh -c 'cp /data/acme.json /data/acme.json.bak && : > /data/acme.json && chmod 600 /data/acme.json'
docker compose up -d traefik
# Watch first 60s of logs to confirm cert issuance:
docker compose logs -f traefikroot-n8n-1 crashing
cd /root
docker compose logs n8n --tail 100
docker compose logs n8n --tail 200 | grep -iE "error|fatal|database"
# n8n stores its SQLite DB + creds in the n8n_data volume. Healthcheck hits /healthz.
docker inspect root-n8n-1 --format '{{json .State.Health}}'
# Restart
docker compose restart n8n
sleep 10 && curl -fsS http://127.0.0.1:5678/healthz
# If n8n won't start due to a corrupt workflow JSON / migration:
# 1. Take a volume snapshot before touching anything.
docker run --rm -v n8n_data:/src -v /root/backups:/dst alpine \
tar czf /dst/n8n_data-$(date +%F-%H%M).tar.gz -C /src .
# 2. Then re-run with N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN=true
# or roll back the offending workflow from the n8n UI.waha crashing
See Alert: WAHAContainerDown below for the full procedure.
claude-proxy crashing
claude-proxy is an HTTP wrapper around the claude CLI listening on 127.0.0.1:9999. Used by n8n workflows. If it crashes, OAuth tokens or the CLI binary are the usual suspects.
# It's a separate compose project, not in /root/docker-compose.yml.
docker ps --filter name=claude-proxy
docker logs claude-proxy --tail 100
# Check the underlying claude CLI auth (uses /home/claude-runner/.claude/.credentials.json)
docker exec claude-proxy claude --version
docker exec claude-proxy ls -la /home/claude-runner/.claude/
# Restart container
docker restart claude-proxy
sleep 3 && curl -fsS http://127.0.0.1:9999/health || curl -fsS http://127.0.0.1:9999/
# If logs show "Invalid bearer token" / "OAuth expired" — refresh tokens locally:
# python d:\tmp\reauth-hstgr.py
# then re-copy /home/claude-runner/.claude/.credentials.json onto the VPS and:
docker restart claude-proxyAlert: LowDisk
Symptom: Disk free < 15% on /. Hostinger plan disks are small — n8n binary executions and WAHA media downloads are the usual offenders.
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
# Where is the space going?
df -h /
du -sh /var/lib/docker/* 2>/dev/null | sort -h | tail -15
du -sh /root/* 2>/dev/null | sort -h | tail -15
du -sh /local-files/* 2>/dev/null | sort -h | tail -15 # mounted into n8n
# Clean Docker (stopped containers, unused images, dangling volumes)
docker system prune -f
docker image prune -af --filter "until=168h"
# Truncate n8n's binary-data and old execution rows (n8n stores SQLite at /home/node/.n8n/database.sqlite)
docker exec root-n8n-1 du -sh /home/node/.n8n
# Trim executions older than 14 days via n8n CLI:
docker exec root-n8n-1 n8n executionsTrim --hours 336
# Clean system journal
journalctl --vacuum-time=7d
# Rotate WAHA session storage if huge (rare — only sessions, not media)
docker run --rm -v root_waha_sessions:/s alpine du -sh /sAlert: HighMemory (>85%)
Symptom: (1 - MemAvailable/MemTotal) * 100 > 85 for 5m. With only 8 GB RAM, n8n + WAHA chromium-free engine + cadvisor get tight under load.
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
# Top consumers
free -h
ps aux --sort=-%mem | head -15
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.CPUPerc}}"
# n8n is usually the biggest single consumer. If it's leaking:
docker compose -f /root/docker-compose.yml restart n8n
# Add a memory limit (edit /root/docker-compose.yml under the n8n service):
# deploy:
# resources:
# limits:
# memory: 2g
# Then: docker compose up -d n8n
# If WAHA NOWEB engine balloons (rare — NOWEB is light):
docker compose restart wahaAlert: HighCPU (>80% for 5m)
Symptom: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80. Only 2 vCPUs on this host so spikes are easy.
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
top -b -n1 | head -25
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
# Usual suspects on vps-h1:
# - n8n workflow stuck in a tight loop / runaway recursion
# - waha doing an initial sync after restart
# - cadvisor scraping itself at high cardinality
# If a single n8n execution is the cause, kill it from the UI or via CLI:
docker exec root-n8n-1 n8n executionsList --limit 5
# (find the running execution ID, then stop it via the UI)
# As a last resort, restart the offender:
docker compose -f /root/docker-compose.yml restart n8nAlert: GH Runner offline (hstgr-srv1072950)
Symptom: runner-hstgr shows as offline in the health-check CI workflow.
The Hostinger runner registers against radieu/et-operational-platform with label hstgr. It runs as a systemd service (not a container) under user claude-runner.
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
# Check service status
systemctl status actions.runner.radieu-et-operational-platform.hstgr-srv1072950.service
# View recent logs
journalctl -u actions.runner.radieu-et-operational-platform.hstgr-srv1072950.service --since "1h ago" | tail -50
# Restart the service
systemctl restart actions.runner.radieu-et-operational-platform.hstgr-srv1072950.service
# If "Repository not found" / "401 Unauthorized" — the registration token expired.
# Generate a new one and re-register:
gh api -X POST repos/radieu/et-operational-platform/actions/runners/registration-token
# Then as claude-runner:
sudo -u claude-runner bash -c 'cd ~/actions-runner && ./config.sh remove --token <OLD_TOKEN>; ./config.sh --url https://github.com/radieu/et-operational-platform --token <NEW_TOKEN> --labels hstgr --unattended'
systemctl restart actions.runner.radieu-et-operational-platform.hstgr-srv1072950.serviceAlert: WAHAContainerDown
Symptom: Alert WAHAContainerDown fires (container waha not seen by cAdvisor for >2 min). WhatsApp message collection stops; n8n wa-router webhook stops receiving events; Supabase whatsapp_messages table no longer fills.
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
# 1. Confirm container state
docker ps -a | grep waha
docker logs waha --tail 100
# 2. Try a plain restart first
docker start waha
sleep 5 && docker logs waha --tail 30
# 3. If still failing — full compose-managed restart
cd /root
docker compose up -d waha
docker compose logs waha --tail 50
# 4. Verify the API is up and session is WORKING (not SCAN_QR_CODE)
WAHA_API_KEY=$(grep ^WAHA_API_KEY /root/.env | cut -d= -f2)
curl -s -H "X-Api-Key: $WAHA_API_KEY" http://127.0.0.1:13000/api/sessions | python3 -m json.tool
# 5. If session.status is SCAN_QR_CODE — the auth was lost.
# Session storage volume:
docker volume inspect root_waha_sessions
# Losing this volume means RE-SCANNING THE QR CODE on the physical phone
# (DE number +49 1578 5573196). Do NOT delete the volume without confirming
# physical access to the device first.
# 6. Get the QR code (only if session is in SCAN_QR_CODE state)
curl -s -H "X-Api-Key: $WAHA_API_KEY" http://127.0.0.1:13000/api/screenshot -o /tmp/qr.png
# Then scp /tmp/qr.png to local and scan from WhatsApp → Linked Devices.
# 7. Verify webhook is reaching n8n after recovery
docker compose logs n8n --tail 50 | grep wa-routerAlert: WAHAHighRestarts
Symptom: Alert WAHAHighRestarts fires (>2 restarts in 1h). WAHA is functional but flapping.
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
# 1. Look at the restart pattern and exit codes
docker inspect waha --format '{{.RestartCount}} restarts, last exit code {{.State.ExitCode}}'
docker logs waha --tail 200 | grep -iE "fatal|panic|oom|killed|error"
# 2. Check for OOM kills (8 GB RAM is tight when n8n is also busy)
dmesg -T | grep -i "out of memory" | tail -5
docker stats waha --no-stream
# 3. Check webhook backpressure — if n8n is slow, WAHA may time out and crash
curl -fsS http://127.0.0.1:5678/healthz
# 4. If OOM is the cause, add a memory limit (edit /root/docker-compose.yml waha service):
# deploy:
# resources:
# limits:
# memory: 1500m
docker compose up -d waha
# 5. If symptoms persist, capture a full bug-report tarball before any further restart:
docker logs waha > /root/waha-debug-$(date +%F-%H%M).log 2>&1
docker inspect waha > /root/waha-inspect-$(date +%F-%H%M).jsonAlert: n8n workflow execution failure spike (alert TBD)
Symptom: No Prometheus alert exists for this yet (see follow-up issue — alert spec belongs in its own PR). Trigger today is manual: user reports workflows silently failing, or the n8n UI Executions panel shows a wall of red.
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
# 1. List recent failed executions (CLI is faster than UI under load)
docker exec root-n8n-1 n8n executionsList --status error --limit 20
# 2. Inspect a specific failure
docker exec root-n8n-1 n8n executionsGet --id <EXECUTION_ID> | python3 -m json.tool | head -100
# 3. Check container-level errors (DB locks, webhook timeouts, OOM)
docker logs root-n8n-1 --tail 200 | grep -iE "error|failed|timeout|lock"
# 4. Common causes & quick fixes:
# a) SQLite lock — restart n8n
docker compose -f /root/docker-compose.yml restart n8n
# b) Webhook target down (wa-router, claude-proxy) — verify
curl -fsS http://127.0.0.1:9999/ ; curl -fsS http://127.0.0.1:13000/health
# c) Disk full preventing DB writes — see LowDisk section above
df -h /
# 5. Re-run failed executions in batch from the UI:
# n8n.vps-h1.infra.zintegrowana.online → Executions → filter Failed → Retry
# 6. If a specific workflow is the source of the spike, disable it from the UI
# until the upstream service is fixed. Re-enable once verified.Follow-up: Spec a Prometheus alert based on the n8n metrics endpoint (/metrics if enabled, or a scraper that polls executionsList --status error) — track in a separate issue.
Disaster recovery — restore from backup
Status — 2026-05-12: spec 01-backups.md is not yet landed. There are no automated, off-host backups of vps-h1 state. If this VPS dies, recovery means rebuilding from scratch.
This section is a placeholder. Update it (and remove this warning) as part of the spec-01 PR.
Current state (no spec-01 backups)
Data at risk if vps-h1 disk dies:
| Volume | Contents | Recoverable? |
|---|---|---|
traefik_data | Let’s Encrypt acme.json | Yes — re-issued on first boot |
n8n_data | All workflows, credentials, executions, SQLite DB | No — total loss |
root_waha_sessions | WhatsApp NOWEB session for +49 1578 5573196 | No — re-scan QR on physical phone |
Manual rebuild procedure (current best-effort)
# 1. Provision a replacement VPS (Hostinger panel or use the provision-vps skill)
# Note the new IP.
# 2. From your workstation, run the provisioning workflow:
gh workflow run provision-new-vps.yml --repo radieu/p24-infra \
-f vps_ip=<NEW_IP> -f vps_label=vps-h1-new
# 3. Update DNS to point the wildcard at the new IP:
python3 /opt/p24-infra/scripts/dns-manager.py upsert "*.vps-h1.infra.zintegrowana.online" <NEW_IP>
# 4. Bring up the Hostinger stack from the repo (no data, fresh start):
ssh -i ~/.ssh/id_ed25519 root@<NEW_IP> \
"git clone https://github.com/radieu/p24-infra /opt/p24-infra && \
cp /opt/p24-infra/hostinger/docker-compose.yml /root/docker-compose.yml && \
cd /root && docker compose up -d"
# 5. n8n: import workflows manually from git history or n8n.cloud backup.
# There is NO automated way today — this is the gap spec-01 closes.
# 6. WAHA: re-pair WhatsApp.
# Get the QR via the WAHAContainerDown procedure above.
# Physical phone (+49 1578 5573196) must be available.
# 7. claude-proxy: copy /home/claude-runner/.claude/.credentials.json from
# your local workstation onto the new VPS (see CLAUDE.md "Provisioning new VPS").After spec-01 lands
Replace this section with the real restore drill:
- restore
n8n_datafrom off-host backup (Wasabi / Backblaze) - restore
root_waha_sessionsto avoid QR re-scan - verify by running the spec-01 acceptance test (n8n boots with all workflows present, test workflow executes successfully)
Routine maintenance
Monthly OS updates (Ubuntu 24.04)
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
apt update && apt list --upgradable
apt upgrade -y
apt autoremove -y
reboot # if kernel updatedDocker image updates
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
cd /root && docker compose pull && docker compose up -d
docker image prune -afSanity check after any restart
# All compose services healthy?
cd /root && docker compose ps
# Public endpoints reachable?
curl -fsS -o /dev/null -w "n8n %{http_code}\n" https://n8n.vps-h1.infra.zintegrowana.online/healthz
curl -fsS -o /dev/null -w "waha %{http_code}\n" https://waha2.vps-h1.infra.zintegrowana.online/api/sessions \
-H "X-Api-Key: $(grep ^WAHA_API_KEY /root/.env | cut -d= -f2)"
# claude-proxy internal
curl -fsS http://127.0.0.1:9999/ | head -5