AI Agent Operations Guide

Scope: Claude Code agents running autonomously on p24-infra VPS and bare-metal servers Last updated: 2026-06-18

1. Agent Fleet Overview

Label	Host	Provider	OS	IP	vCPU	RAM	Max parallel
`AI-Dev-IO1`	vps-i1	IONOS	AlmaLinux 9.7	`217.154.82.162`	6	8 GB	2–3
`AI-Dev-HS1`	vps-h1	Hostinger	Ubuntu 24.04	`72.60.32.61`	2	8 GB	1–2
`AI-Dev-BMS4-1`	bms-4	OVH Kimsufi	Ubuntu 22.04	`54.36.123.110`	8	32 GB	4

All agents run as the claude-runner user on their respective hosts. GitHub Actions runner runs as claude-runner on vps-i1 (IONOS) — it dispatches workflow jobs that invoke Claude Code.

2. Authentication

Claude Code OAuth (all agents)

Claude Code authenticates via OAuth subscription (Claude Max), not via ANTHROPIC_API_KEY.

Credential	Location on each VPS	Notes
Access token	`/home/claude-runner/.claude/.credentials.json`	~8–12h TTL; auto-refreshed
Refresh token	`/home/claude-runner/.claude/.credentials.json`	Valid for months

Tokens rotate automatically when Claude Code is actively used. Long-idle agents may expire.

When auth expires

Discord alert fires to #infra-alerts. Re-authenticate by running on the local Windows machine:

python d:\tmp\reauth-hstgr.py

This refreshes the OAuth token and pushes updated credentials to the affected agent. Do not share or log the token values — reference only by path.

3. Per-Agent SSH and Status

AI-Dev-IO1 (vps-i1, IONOS)

# SSH
ssh root@217.154.82.162
 
# Check GitHub Actions runner
sudo /opt/actions-runner/svc.sh status
 
# View agent logs (GitHub Actions runner)
journalctl -u actions.runner.radieu-p24-infra.ionos -n 100
 
# Check active Claude Code processes
ps aux | grep claude

Max parallel agents: 2–3 (limited by 8 GB RAM; monitoring stack also runs here).

AI-Dev-HS1 (vps-h1, Hostinger)

# SSH
ssh root@72.60.32.61
 
# View claude-runner session
ps aux | grep claude
ls -la /home/claude-runner/.claude/
 
# View systemd service if configured
journalctl -u claude-agent -n 100

Note: vps-h1 is a PROTECTED host. Role is fixed: WAHA gateway + claude-proxy + monitoring agents. Do not add new Docker services here. Max 1–2 parallel agents given 2 vCPUs.

AI-Dev-BMS4-1 (bms-4, OVH)

# SSH
ssh root@54.36.123.110
 
# Check claude-runner user sessions
ps aux | grep claude
 
# View agent logs
journalctl -u claude-agent -n 100 2>/dev/null || \
  ls /home/claude-runner/.claude/logs/
 
# Check available capacity
free -h
nproc

Max parallel agents: 4 (8 vCPU, 32 GB RAM — most capable agent host).

4. Claude Proxy

The Claude proxy allows n8n and audit-engine to use Claude without consuming API credits.

Property	Value
Host	vps-h1 (`72.60.32.61`)
Port	`9999`
Protocol	HTTP; Anthropic-format JSON (`POST /v1/messages`)
Auth header	`X-Proxy-Secret: <value>` — stored in Infisical `n8n-bms4` project
Implementation	Python server wrapping `claude -p` CLI
Run as	`claude-runner` user on vps-h1

Check proxy status

ssh root@72.60.32.61
systemctl status claude-proxy
# or check process:
ps aux | grep "claude-proxy"
curl -s http://localhost:9999/health

Restart proxy

ssh root@72.60.32.61
systemctl restart claude-proxy

Audit-engine fallback

The audit-engine on vps-h1 calls claude-proxy:9999 first. If that fails, it falls back to ANTHROPIC_API_KEY direct call. The fallback consumes API credits — monitor cost.

5. Provisioning a New Agent

When adding a new AI Dev runner (next slots: AI-Dev-IO2, AI-Dev-HS2, AI-Dev-BMS4-2, etc.):

Assign label — use the next available slot for the host (IO1→IO2, HS1→HS2, BMS4-1→BMS4-2)
Provision VPS — run the /provision-vps skill in a p24-infra Claude session; install Claude Code:
```
npm install -g @anthropic-ai/claude-code
```

Register email — add Cloudflare Email Routing rule via API:

cf("POST", f"/zones/{ZONE}/email/routing/rules", {
    "name": "AI-Dev-XXX",
    "enabled": True,
    "matchers": [{"type": "literal", "field": "to", "value": "ai-dev-xxx@zintegrowana.online"}],
    "actions": [{"type": "forward", "value": ["radieu@gmail.com"]}]
})

Zone ID: 57cb3d8f24c7cc319fb703394edc7b87

Create GitHub account — browser signup using ai-dev-xxx@zintegrowana.online

Add as collaborator — invite to both repos:

gh api repos/radieu/p24-infra/collaborators/AI-Dev-XXX -X PUT -f permission=write
gh api repos/radieu/et-operational-platform/collaborators/AI-Dev-XXX -X PUT -f permission=write

Create human-action issue in radieu/p24-infra with label human-action: “Confirm GitHub invitations for AI-Dev-XXX”
Accept invitations — human (radieu) clicks both invitation emails, then closes the issue

Copy credentials — copy ~/.claude/.credentials.json from local or another agent:

scp C:\Users\konar\.claude\.credentials.json root@<new-vps>:/home/claude-runner/.claude/
chown claude-runner:claude-runner /home/claude-runner/.claude/.credentials.json

6. Monitoring Agent Health

GitHub Actions (primary signal)

Check https://github.com/radieu/p24-infra/actions for recent workflow runs
Failed runs indicate the agent encountered errors
Long-pending runs may indicate capacity exhaustion or auth failure

Discord alerts

All auth expiry events and agent errors post to #infra-alerts via Discord webhook. Variable name: P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL

Cost monitoring

Agent / month	Claude Max plan	Approx. cost
Per seat	$10 / 20 accounts flat	~€9.25/agent
3 active agents	—	~€27.75/month

Claude Max subscription is a flat per-account fee — cost does not scale with usage volume.

7. Troubleshooting

Auth expired

Symptom: Discord alert, or GitHub Actions logs show 401 Unauthorized from Claude.

Fix: Run python d:\tmp\reauth-hstgr.py on local Windows machine. Wait ~2 min, then retry the failing workflow.

Agent stuck / no response

Symptom: GitHub Actions run in progress for >30 min with no output.

Fix:

ssh root@<affected-vps>
ps aux | grep claude
kill -9 <pid>   # only if confirmed hung

Re-run the GitHub Actions workflow manually after killing the process.

Runner capacity exceeded

Symptom: GitHub Actions jobs queue but never start; runner shows busy.

Fix: Check how many parallel Claude Code processes are running:

ps aux | grep claude | wc -l

If at limit, wait for active jobs to finish. Do not increase parallelism beyond the limits in §1 without first verifying available RAM.

Agent out of date (stale repo)

Symptom: agent operates on old code; missing recent commits.

Fix:

ssh <agent-host>
cd /opt/p24-infra   # or wherever repo is cloned
git pull origin main

8. Agent Safety Rules

All agents in this fleet must follow these rules without exception:

Never delete or overwrite without a recovery path — confirm before any irreversible action
Never cause infrastructure downtime without explicit human permission
Never push directly to main, dev, staging, or rc — all changes via PR
Ask human when: action has data loss risk · regression risk · requires human steps you cannot complete autonomously · decision affects shared infrastructure
Prefer reversible: restart > stop · migrate > drop · rename > delete · branch > overwrite
Security: never log secrets · never commit .env · rotate credentials immediately on suspected exposure

claude-agent-setup.md — original IONOS agent setup reference
vps-i1-operations.md — IONOS VPS ops
hostinger-runbook.md — Hostinger VPS ops
p4-ovh-bms-4-ns3101999-operations.md — bms-4 ops
claude-proxy-router-operations.md — Claude proxy detail
disaster-recovery-runbook.md — including agent auth failure recovery

p24-infra Docs

Explorer

ai-agent-operations

AI Agent Operations Guide

1. Agent Fleet Overview

2. Authentication

Claude Code OAuth (all agents)

When auth expires

3. Per-Agent SSH and Status

AI-Dev-IO1 (vps-i1, IONOS)

AI-Dev-HS1 (vps-h1, Hostinger)

AI-Dev-BMS4-1 (bms-4, OVH)

4. Claude Proxy

Check proxy status

Restart proxy

Audit-engine fallback

5. Provisioning a New Agent

6. Monitoring Agent Health

GitHub Actions (primary signal)

Discord alerts

Cost monitoring

7. Troubleshooting

Auth expired

Agent stuck / no response

Runner capacity exceeded

Agent out of date (stale repo)

8. Agent Safety Rules

Graph View

Table of Contents

p24-infra Docs

Explorer

ai-agent-operations

AI Agent Operations Guide

1. Agent Fleet Overview

2. Authentication

Claude Code OAuth (all agents)

When auth expires

3. Per-Agent SSH and Status

AI-Dev-IO1 (vps-i1, IONOS)

AI-Dev-HS1 (vps-h1, Hostinger)

AI-Dev-BMS4-1 (bms-4, OVH)

4. Claude Proxy

Check proxy status

Restart proxy

Audit-engine fallback

5. Provisioning a New Agent

6. Monitoring Agent Health

GitHub Actions (primary signal)

Discord alerts

Cost monitoring

7. Troubleshooting

Auth expired

Agent stuck / no response

Runner capacity exceeded

Agent out of date (stale repo)

8. Agent Safety Rules

9. Related Documentation

Graph View

Table of Contents