Hostinger VPS — Operations Runbook

Server: 72.60.32.61 (vps-h1, hostname srv1072950.hstgr.cloud, Ubuntu 24.04.3 LTS, 8 GB RAM, 2 vCPU AMD EPYC 9354P) SSH: ssh -i C:\Users\konar\.ssh\id_ed25519 root@72.60.32.61 (Windows) — or ssh -i ~/.ssh/id_ed25519 root@72.60.32.61 (Linux)

No claude-admin scoped user exists on this host yet (only on IONOS). All ops below assume root via the local id_ed25519 key. UFW is currently inactive — exposure is gated only at the Docker port-binding layer (127.0.0.1:*) and Traefik labels.

Compose root: /root/docker-compose.yml (tracked at hostinger/docker-compose.yml). Services on this host: root-traefik-1, root-n8n-1, waha, claude-proxy, root-node-exporter-1, root-cadvisor-1.

Alert: ServerDown

Symptom: Prometheus ServerDown alert fires for instance="72.60.32.61:9100". Node exporter unreachable.

# 1. Ping the server
ping 72.60.32.61
 
# 2. Try SSH
ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
 
# 3. If no SSH — log into Hostinger panel and check VPS state
#    https://hpanel.hostinger.com → VPS → srv1072950.hstgr.cloud
#    Use the panel's serial/recovery console to inspect kernel state.
 
# 4. If VPS running but SSH blocked — check firewall (should be inactive)
ufw status verbose
iptables -L -n | head -40
 
# 5. Restart node-exporter container if VPS is up but Prometheus can't scrape
cd /root && docker compose ps node-exporter
cd /root && docker compose restart node-exporter
curl -s http://127.0.0.1:9100/metrics | head -5    # sanity check

Alert: ContainerCrashLooping / ContainerHighRestarts

Symptom: Prometheus fires ContainerHighRestarts (>3 restarts in 1h) or ContainerCrashLooping (>5 restarts in 30m) for a container on vps-h1.

ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
 
# Identify the crashing container
docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.RestartCount}}"
docker events --filter type=container --filter event=die --since 1h
 
# Generic recovery: tail logs, inspect health, then restart
docker logs <container-name> --tail 100
docker inspect <container-name> | python3 -c "import json,sys; d=json.load(sys.stdin); print(d[0]['State'])"
cd /root && docker compose restart <service-name>

root-traefik-1 crashing

cd /root
docker compose logs traefik --tail 100
docker compose logs traefik --tail 200 | grep -iE "acme|cert|error|panic"
 
# Common cause: ACME / Let's Encrypt rate-limit or DNS failure on cert renew.
docker volume inspect traefik_data
docker run --rm -v traefik_data:/data alpine ls -la /data           # inspect acme.json
 
# If acme.json is corrupted or rate-limited — back it up and restart:
docker compose stop traefik
docker run --rm -v traefik_data:/data alpine sh -c 'cp /data/acme.json /data/acme.json.bak && : > /data/acme.json && chmod 600 /data/acme.json'
docker compose up -d traefik
# Watch first 60s of logs to confirm cert issuance:
docker compose logs -f traefik

root-n8n-1 crashing

cd /root
docker compose logs n8n --tail 100
docker compose logs n8n --tail 200 | grep -iE "error|fatal|database"
 
# n8n stores its SQLite DB + creds in the n8n_data volume. Healthcheck hits /healthz.
docker inspect root-n8n-1 --format '{{json .State.Health}}'
 
# Restart
docker compose restart n8n
sleep 10 && curl -fsS http://127.0.0.1:5678/healthz
 
# If n8n won't start due to a corrupt workflow JSON / migration:
# 1. Take a volume snapshot before touching anything.
docker run --rm -v n8n_data:/src -v /root/backups:/dst alpine \
  tar czf /dst/n8n_data-$(date +%F-%H%M).tar.gz -C /src .
# 2. Then re-run with N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN=true
#    or roll back the offending workflow from the n8n UI.

waha crashing

See Alert: WAHAContainerDown below for the full procedure.

claude-proxy crashing

claude-proxy is an HTTP wrapper around the claude CLI listening on 127.0.0.1:9999. Used by n8n workflows. If it crashes, OAuth tokens or the CLI binary are the usual suspects.

# It's a separate compose project, not in /root/docker-compose.yml.
docker ps --filter name=claude-proxy
docker logs claude-proxy --tail 100
 
# Check the underlying claude CLI auth (uses /home/claude-runner/.claude/.credentials.json)
docker exec claude-proxy claude --version
docker exec claude-proxy ls -la /home/claude-runner/.claude/
 
# Restart container
docker restart claude-proxy
sleep 3 && curl -fsS http://127.0.0.1:9999/health || curl -fsS http://127.0.0.1:9999/
 
# If logs show "Invalid bearer token" / "OAuth expired" — refresh tokens locally:
#   python d:\tmp\reauth-hstgr.py
# then re-copy /home/claude-runner/.claude/.credentials.json onto the VPS and:
docker restart claude-proxy

Alert: LowDisk

Symptom: Disk free < 15% on /. Hostinger plan disks are small — n8n binary executions and WAHA media downloads are the usual offenders.

ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
 
# Where is the space going?
df -h /
du -sh /var/lib/docker/* 2>/dev/null | sort -h | tail -15
du -sh /root/* 2>/dev/null | sort -h | tail -15
du -sh /local-files/* 2>/dev/null | sort -h | tail -15   # mounted into n8n
 
# Clean Docker (stopped containers, unused images, dangling volumes)
docker system prune -f
docker image prune -af --filter "until=168h"
 
# Truncate n8n's binary-data and old execution rows (n8n stores SQLite at /home/node/.n8n/database.sqlite)
docker exec root-n8n-1 du -sh /home/node/.n8n
# Trim executions older than 14 days via n8n CLI:
docker exec root-n8n-1 n8n executionsTrim --hours 336
 
# Clean system journal
journalctl --vacuum-time=7d
 
# Rotate WAHA session storage if huge (rare — only sessions, not media)
docker run --rm -v root_waha_sessions:/s alpine du -sh /s

Alert: HighMemory (>85%)

Symptom: (1 - MemAvailable/MemTotal) * 100 > 85 for 5m. With only 8 GB RAM, n8n + WAHA chromium-free engine + cadvisor get tight under load.

ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
 
# Top consumers
free -h
ps aux --sort=-%mem | head -15
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.CPUPerc}}"
 
# n8n is usually the biggest single consumer. If it's leaking:
docker compose -f /root/docker-compose.yml restart n8n
 
# Add a memory limit (edit /root/docker-compose.yml under the n8n service):
#   deploy:
#     resources:
#       limits:
#         memory: 2g
# Then: docker compose up -d n8n
 
# If WAHA NOWEB engine balloons (rare — NOWEB is light):
docker compose restart waha

Alert: HighCPU (>80% for 5m)

Symptom: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80. Only 2 vCPUs on this host so spikes are easy.

ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
 
top -b -n1 | head -25
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
 
# Usual suspects on vps-h1:
#  - n8n workflow stuck in a tight loop / runaway recursion
#  - waha doing an initial sync after restart
#  - cadvisor scraping itself at high cardinality
 
# If a single n8n execution is the cause, kill it from the UI or via CLI:
docker exec root-n8n-1 n8n executionsList --limit 5
# (find the running execution ID, then stop it via the UI)
 
# As a last resort, restart the offender:
docker compose -f /root/docker-compose.yml restart n8n

Alert: GH Runner offline (hstgr-srv1072950)

Symptom: runner-hstgr shows as offline in the health-check CI workflow.

The Hostinger runner registers against radieu/et-operational-platform with label hstgr. It runs as a systemd service (not a container) under user claude-runner.

ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
 
# Check service status
systemctl status actions.runner.radieu-et-operational-platform.hstgr-srv1072950.service
 
# View recent logs
journalctl -u actions.runner.radieu-et-operational-platform.hstgr-srv1072950.service --since "1h ago" | tail -50
 
# Restart the service
systemctl restart actions.runner.radieu-et-operational-platform.hstgr-srv1072950.service
 
# If "Repository not found" / "401 Unauthorized" — the registration token expired.
# Generate a new one and re-register:
gh api -X POST repos/radieu/et-operational-platform/actions/runners/registration-token
# Then as claude-runner:
sudo -u claude-runner bash -c 'cd ~/actions-runner && ./config.sh remove --token <OLD_TOKEN>; ./config.sh --url https://github.com/radieu/et-operational-platform --token <NEW_TOKEN> --labels hstgr --unattended'
systemctl restart actions.runner.radieu-et-operational-platform.hstgr-srv1072950.service

Alert: WAHAContainerDown

Symptom: Alert WAHAContainerDown fires (container waha not seen by cAdvisor for >2 min). WhatsApp message collection stops; n8n wa-router webhook stops receiving events; Supabase whatsapp_messages table no longer fills.

ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
 
# 1. Confirm container state
docker ps -a | grep waha
docker logs waha --tail 100
 
# 2. Try a plain restart first
docker start waha
sleep 5 && docker logs waha --tail 30
 
# 3. If still failing — full compose-managed restart
cd /root
docker compose up -d waha
docker compose logs waha --tail 50
 
# 4. Verify the API is up and session is WORKING (not SCAN_QR_CODE)
WAHA_API_KEY=$(grep ^WAHA_API_KEY /root/.env | cut -d= -f2)
curl -s -H "X-Api-Key: $WAHA_API_KEY" http://127.0.0.1:13000/api/sessions | python3 -m json.tool
 
# 5. If session.status is SCAN_QR_CODE — the auth was lost.
#    Session storage volume:
docker volume inspect root_waha_sessions
#    Losing this volume means RE-SCANNING THE QR CODE on the physical phone
#    (DE number +49 1578 5573196). Do NOT delete the volume without confirming
#    physical access to the device first.
 
# 6. Get the QR code (only if session is in SCAN_QR_CODE state)
curl -s -H "X-Api-Key: $WAHA_API_KEY" http://127.0.0.1:13000/api/screenshot -o /tmp/qr.png
# Then scp /tmp/qr.png to local and scan from WhatsApp → Linked Devices.
 
# 7. Verify webhook is reaching n8n after recovery
docker compose logs n8n --tail 50 | grep wa-router

Alert: WAHAHighRestarts

Symptom: Alert WAHAHighRestarts fires (>2 restarts in 1h). WAHA is functional but flapping.

ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
 
# 1. Look at the restart pattern and exit codes
docker inspect waha --format '{{.RestartCount}} restarts, last exit code {{.State.ExitCode}}'
docker logs waha --tail 200 | grep -iE "fatal|panic|oom|killed|error"
 
# 2. Check for OOM kills (8 GB RAM is tight when n8n is also busy)
dmesg -T | grep -i "out of memory" | tail -5
docker stats waha --no-stream
 
# 3. Check webhook backpressure — if n8n is slow, WAHA may time out and crash
curl -fsS http://127.0.0.1:5678/healthz
 
# 4. If OOM is the cause, add a memory limit (edit /root/docker-compose.yml waha service):
#    deploy:
#      resources:
#        limits:
#          memory: 1500m
docker compose up -d waha
 
# 5. If symptoms persist, capture a full bug-report tarball before any further restart:
docker logs waha > /root/waha-debug-$(date +%F-%H%M).log 2>&1
docker inspect waha > /root/waha-inspect-$(date +%F-%H%M).json

Alert: n8n workflow execution failure spike (alert TBD)

Symptom: No Prometheus alert exists for this yet (see follow-up issue — alert spec belongs in its own PR). Trigger today is manual: user reports workflows silently failing, or the n8n UI Executions panel shows a wall of red.

ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
 
# 1. List recent failed executions (CLI is faster than UI under load)
docker exec root-n8n-1 n8n executionsList --status error --limit 20
 
# 2. Inspect a specific failure
docker exec root-n8n-1 n8n executionsGet --id <EXECUTION_ID> | python3 -m json.tool | head -100
 
# 3. Check container-level errors (DB locks, webhook timeouts, OOM)
docker logs root-n8n-1 --tail 200 | grep -iE "error|failed|timeout|lock"
 
# 4. Common causes & quick fixes:
#    a) SQLite lock — restart n8n
docker compose -f /root/docker-compose.yml restart n8n
#    b) Webhook target down (wa-router, claude-proxy) — verify
curl -fsS http://127.0.0.1:9999/ ; curl -fsS http://127.0.0.1:13000/health
#    c) Disk full preventing DB writes — see LowDisk section above
df -h /
 
# 5. Re-run failed executions in batch from the UI:
#    n8n.vps-h1.infra.zintegrowana.online → Executions → filter Failed → Retry
 
# 6. If a specific workflow is the source of the spike, disable it from the UI
#    until the upstream service is fixed. Re-enable once verified.

Follow-up: Spec a Prometheus alert based on the n8n metrics endpoint (/metrics if enabled, or a scraper that polls executionsList --status error) — track in a separate issue.

Disaster recovery — restore from backup

Status — 2026-05-12: spec 01-backups.md is not yet landed. There are no automated, off-host backups of vps-h1 state. If this VPS dies, recovery means rebuilding from scratch.

This section is a placeholder. Update it (and remove this warning) as part of the spec-01 PR.

Current state (no spec-01 backups)

Data at risk if vps-h1 disk dies:

Volume	Contents	Recoverable?
`traefik_data`	Let’s Encrypt `acme.json`	Yes — re-issued on first boot
`n8n_data`	All workflows, credentials, executions, SQLite DB	No — total loss
`root_waha_sessions`	WhatsApp NOWEB session for `+49 1578 5573196`	No — re-scan QR on physical phone

Manual rebuild procedure (current best-effort)

# 1. Provision a replacement VPS (Hostinger panel or use the provision-vps skill)
#    Note the new IP.
 
# 2. From your workstation, run the provisioning workflow:
gh workflow run provision-new-vps.yml --repo radieu/p24-infra \
   -f vps_ip=<NEW_IP> -f vps_label=vps-h1-new
 
# 3. Update DNS to point the wildcard at the new IP:
python3 /opt/p24-infra/scripts/dns-manager.py upsert "*.vps-h1.infra.zintegrowana.online" <NEW_IP>
 
# 4. Bring up the Hostinger stack from the repo (no data, fresh start):
ssh -i ~/.ssh/id_ed25519 root@<NEW_IP> \
  "git clone https://github.com/radieu/p24-infra /opt/p24-infra && \
   cp /opt/p24-infra/hostinger/docker-compose.yml /root/docker-compose.yml && \
   cd /root && docker compose up -d"
 
# 5. n8n: import workflows manually from git history or n8n.cloud backup.
#    There is NO automated way today — this is the gap spec-01 closes.
 
# 6. WAHA: re-pair WhatsApp.
#    Get the QR via the WAHAContainerDown procedure above.
#    Physical phone (+49 1578 5573196) must be available.
 
# 7. claude-proxy: copy /home/claude-runner/.claude/.credentials.json from
#    your local workstation onto the new VPS (see CLAUDE.md "Provisioning new VPS").

After spec-01 lands

Replace this section with the real restore drill:

restore n8n_data from off-host backup (Wasabi / Backblaze)
restore root_waha_sessions to avoid QR re-scan
verify by running the spec-01 acceptance test (n8n boots with all workflows present, test workflow executes successfully)

Routine maintenance

Monthly OS updates (Ubuntu 24.04)

ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
apt update && apt list --upgradable
apt upgrade -y
apt autoremove -y
reboot   # if kernel updated

Docker image updates

ssh -i ~/.ssh/id_ed25519 root@72.60.32.61
cd /root && docker compose pull && docker compose up -d
docker image prune -af

Sanity check after any restart

# All compose services healthy?
cd /root && docker compose ps
 
# Public endpoints reachable?
curl -fsS -o /dev/null -w "n8n  %{http_code}\n"  https://n8n.vps-h1.infra.zintegrowana.online/healthz
curl -fsS -o /dev/null -w "waha %{http_code}\n"  https://waha2.vps-h1.infra.zintegrowana.online/api/sessions \
     -H "X-Api-Key: $(grep ^WAHA_API_KEY /root/.env | cut -d= -f2)"
 
# claude-proxy internal
curl -fsS http://127.0.0.1:9999/ | head -5

p24-infra Docs

Explorer

hostinger-runbook