AI Agent Operations Guide
Scope: Claude Code agents running autonomously on p24-infra VPS and bare-metal servers Last updated: 2026-06-18
1. Agent Fleet Overview
| Label | Host | Provider | OS | IP | vCPU | RAM | Max parallel |
|---|---|---|---|---|---|---|---|
AI-Dev-IO1 | vps-i1 | IONOS | AlmaLinux 9.7 | 217.154.82.162 | 6 | 8 GB | 2–3 |
AI-Dev-HS1 | vps-h1 | Hostinger | Ubuntu 24.04 | 72.60.32.61 | 2 | 8 GB | 1–2 |
AI-Dev-BMS4-1 | bms-4 | OVH Kimsufi | Ubuntu 22.04 | 54.36.123.110 | 8 | 32 GB | 4 |
All agents run as the claude-runner user on their respective hosts. GitHub Actions runner runs
as claude-runner on vps-i1 (IONOS) — it dispatches workflow jobs that invoke Claude Code.
2. Authentication
Claude Code OAuth (all agents)
Claude Code authenticates via OAuth subscription (Claude Max), not via ANTHROPIC_API_KEY.
| Credential | Location on each VPS | Notes |
|---|---|---|
| Access token | /home/claude-runner/.claude/.credentials.json | ~8–12h TTL; auto-refreshed |
| Refresh token | /home/claude-runner/.claude/.credentials.json | Valid for months |
Tokens rotate automatically when Claude Code is actively used. Long-idle agents may expire.
When auth expires
Discord alert fires to #infra-alerts. Re-authenticate by running on the local Windows machine:
python d:\tmp\reauth-hstgr.pyThis refreshes the OAuth token and pushes updated credentials to the affected agent. Do not share or log the token values — reference only by path.
3. Per-Agent SSH and Status
AI-Dev-IO1 (vps-i1, IONOS)
# SSH
ssh root@217.154.82.162
# Check GitHub Actions runner
sudo /opt/actions-runner/svc.sh status
# View agent logs (GitHub Actions runner)
journalctl -u actions.runner.radieu-p24-infra.ionos -n 100
# Check active Claude Code processes
ps aux | grep claudeMax parallel agents: 2–3 (limited by 8 GB RAM; monitoring stack also runs here).
AI-Dev-HS1 (vps-h1, Hostinger)
# SSH
ssh root@72.60.32.61
# View claude-runner session
ps aux | grep claude
ls -la /home/claude-runner/.claude/
# View systemd service if configured
journalctl -u claude-agent -n 100Note: vps-h1 is a PROTECTED host. Role is fixed: WAHA gateway + claude-proxy + monitoring agents. Do not add new Docker services here. Max 1–2 parallel agents given 2 vCPUs.
AI-Dev-BMS4-1 (bms-4, OVH)
# SSH
ssh root@54.36.123.110
# Check claude-runner user sessions
ps aux | grep claude
# View agent logs
journalctl -u claude-agent -n 100 2>/dev/null || \
ls /home/claude-runner/.claude/logs/
# Check available capacity
free -h
nprocMax parallel agents: 4 (8 vCPU, 32 GB RAM — most capable agent host).
4. Claude Proxy
The Claude proxy allows n8n and audit-engine to use Claude without consuming API credits.
| Property | Value |
|---|---|
| Host | vps-h1 (72.60.32.61) |
| Port | 9999 |
| Protocol | HTTP; Anthropic-format JSON (POST /v1/messages) |
| Auth header | X-Proxy-Secret: <value> — stored in Infisical n8n-bms4 project |
| Implementation | Python server wrapping claude -p CLI |
| Run as | claude-runner user on vps-h1 |
Check proxy status
ssh root@72.60.32.61
systemctl status claude-proxy
# or check process:
ps aux | grep "claude-proxy"
curl -s http://localhost:9999/healthRestart proxy
ssh root@72.60.32.61
systemctl restart claude-proxyAudit-engine fallback
The audit-engine on vps-h1 calls claude-proxy:9999 first. If that fails, it falls back to
ANTHROPIC_API_KEY direct call. The fallback consumes API credits — monitor cost.
5. Provisioning a New Agent
When adding a new AI Dev runner (next slots: AI-Dev-IO2, AI-Dev-HS2, AI-Dev-BMS4-2, etc.):
-
Assign label — use the next available slot for the host (IO1→IO2, HS1→HS2, BMS4-1→BMS4-2)
-
Provision VPS — run the
/provision-vpsskill in a p24-infra Claude session; install Claude Code:npm install -g @anthropic-ai/claude-code -
Register email — add Cloudflare Email Routing rule via API:
cf("POST", f"/zones/{ZONE}/email/routing/rules", { "name": "AI-Dev-XXX", "enabled": True, "matchers": [{"type": "literal", "field": "to", "value": "ai-dev-xxx@zintegrowana.online"}], "actions": [{"type": "forward", "value": ["radieu@gmail.com"]}] })Zone ID:
57cb3d8f24c7cc319fb703394edc7b87 -
Create GitHub account — browser signup using
ai-dev-xxx@zintegrowana.online -
Add as collaborator — invite to both repos:
gh api repos/radieu/p24-infra/collaborators/AI-Dev-XXX -X PUT -f permission=write gh api repos/radieu/et-operational-platform/collaborators/AI-Dev-XXX -X PUT -f permission=write -
Create human-action issue in
radieu/p24-infrawith labelhuman-action: “Confirm GitHub invitations for AI-Dev-XXX” -
Accept invitations — human (radieu) clicks both invitation emails, then closes the issue
-
Copy credentials — copy
~/.claude/.credentials.jsonfrom local or another agent:scp C:\Users\konar\.claude\.credentials.json root@<new-vps>:/home/claude-runner/.claude/ chown claude-runner:claude-runner /home/claude-runner/.claude/.credentials.json
6. Monitoring Agent Health
GitHub Actions (primary signal)
- Check
https://github.com/radieu/p24-infra/actionsfor recent workflow runs - Failed runs indicate the agent encountered errors
- Long-pending runs may indicate capacity exhaustion or auth failure
Discord alerts
All auth expiry events and agent errors post to #infra-alerts via Discord webhook.
Variable name: P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL
Cost monitoring
| Agent / month | Claude Max plan | Approx. cost |
|---|---|---|
| Per seat | $10 / 20 accounts flat | ~€9.25/agent |
| 3 active agents | — | ~€27.75/month |
Claude Max subscription is a flat per-account fee — cost does not scale with usage volume.
7. Troubleshooting
Auth expired
Symptom: Discord alert, or GitHub Actions logs show 401 Unauthorized from Claude.
Fix: Run python d:\tmp\reauth-hstgr.py on local Windows machine. Wait ~2 min, then
retry the failing workflow.
Agent stuck / no response
Symptom: GitHub Actions run in progress for >30 min with no output.
Fix:
ssh root@<affected-vps>
ps aux | grep claude
kill -9 <pid> # only if confirmed hungRe-run the GitHub Actions workflow manually after killing the process.
Runner capacity exceeded
Symptom: GitHub Actions jobs queue but never start; runner shows busy.
Fix: Check how many parallel Claude Code processes are running:
ps aux | grep claude | wc -lIf at limit, wait for active jobs to finish. Do not increase parallelism beyond the limits in §1 without first verifying available RAM.
Agent out of date (stale repo)
Symptom: agent operates on old code; missing recent commits.
Fix:
ssh <agent-host>
cd /opt/p24-infra # or wherever repo is cloned
git pull origin main8. Agent Safety Rules
All agents in this fleet must follow these rules without exception:
- Never delete or overwrite without a recovery path — confirm before any irreversible action
- Never cause infrastructure downtime without explicit human permission
- Never push directly to
main,dev,staging, orrc— all changes via PR - Ask human when: action has data loss risk · regression risk · requires human steps you cannot complete autonomously · decision affects shared infrastructure
- Prefer reversible: restart > stop · migrate > drop · rename > delete · branch > overwrite
- Security: never log secrets · never commit
.env· rotate credentials immediately on suspected exposure
9. Related Documentation
- claude-agent-setup.md — original IONOS agent setup reference
- vps-i1-operations.md — IONOS VPS ops
- hostinger-runbook.md — Hostinger VPS ops
- p4-ovh-bms-4-ns3101999-operations.md — bms-4 ops
- claude-proxy-router-operations.md — Claude proxy detail
- disaster-recovery-runbook.md — including agent auth failure recovery