AI Agent Operations Guide

Scope: Claude Code agents running autonomously on p24-infra VPS and bare-metal servers Last updated: 2026-06-18


1. Agent Fleet Overview

LabelHostProviderOSIPvCPURAMMax parallel
AI-Dev-IO1vps-i1IONOSAlmaLinux 9.7217.154.82.16268 GB2–3
AI-Dev-HS1vps-h1HostingerUbuntu 24.0472.60.32.6128 GB1–2
AI-Dev-BMS4-1bms-4OVH KimsufiUbuntu 22.0454.36.123.110832 GB4

All agents run as the claude-runner user on their respective hosts. GitHub Actions runner runs as claude-runner on vps-i1 (IONOS) — it dispatches workflow jobs that invoke Claude Code.


2. Authentication

Claude Code OAuth (all agents)

Claude Code authenticates via OAuth subscription (Claude Max), not via ANTHROPIC_API_KEY.

CredentialLocation on each VPSNotes
Access token/home/claude-runner/.claude/.credentials.json~8–12h TTL; auto-refreshed
Refresh token/home/claude-runner/.claude/.credentials.jsonValid for months

Tokens rotate automatically when Claude Code is actively used. Long-idle agents may expire.

When auth expires

Discord alert fires to #infra-alerts. Re-authenticate by running on the local Windows machine:

python d:\tmp\reauth-hstgr.py

This refreshes the OAuth token and pushes updated credentials to the affected agent. Do not share or log the token values — reference only by path.


3. Per-Agent SSH and Status

AI-Dev-IO1 (vps-i1, IONOS)

# SSH
ssh root@217.154.82.162
 
# Check GitHub Actions runner
sudo /opt/actions-runner/svc.sh status
 
# View agent logs (GitHub Actions runner)
journalctl -u actions.runner.radieu-p24-infra.ionos -n 100
 
# Check active Claude Code processes
ps aux | grep claude

Max parallel agents: 2–3 (limited by 8 GB RAM; monitoring stack also runs here).

AI-Dev-HS1 (vps-h1, Hostinger)

# SSH
ssh root@72.60.32.61
 
# View claude-runner session
ps aux | grep claude
ls -la /home/claude-runner/.claude/
 
# View systemd service if configured
journalctl -u claude-agent -n 100

Note: vps-h1 is a PROTECTED host. Role is fixed: WAHA gateway + claude-proxy + monitoring agents. Do not add new Docker services here. Max 1–2 parallel agents given 2 vCPUs.

AI-Dev-BMS4-1 (bms-4, OVH)

# SSH
ssh root@54.36.123.110
 
# Check claude-runner user sessions
ps aux | grep claude
 
# View agent logs
journalctl -u claude-agent -n 100 2>/dev/null || \
  ls /home/claude-runner/.claude/logs/
 
# Check available capacity
free -h
nproc

Max parallel agents: 4 (8 vCPU, 32 GB RAM — most capable agent host).


4. Claude Proxy

The Claude proxy allows n8n and audit-engine to use Claude without consuming API credits.

PropertyValue
Hostvps-h1 (72.60.32.61)
Port9999
ProtocolHTTP; Anthropic-format JSON (POST /v1/messages)
Auth headerX-Proxy-Secret: <value> — stored in Infisical n8n-bms4 project
ImplementationPython server wrapping claude -p CLI
Run asclaude-runner user on vps-h1

Check proxy status

ssh root@72.60.32.61
systemctl status claude-proxy
# or check process:
ps aux | grep "claude-proxy"
curl -s http://localhost:9999/health

Restart proxy

ssh root@72.60.32.61
systemctl restart claude-proxy

Audit-engine fallback

The audit-engine on vps-h1 calls claude-proxy:9999 first. If that fails, it falls back to ANTHROPIC_API_KEY direct call. The fallback consumes API credits — monitor cost.


5. Provisioning a New Agent

When adding a new AI Dev runner (next slots: AI-Dev-IO2, AI-Dev-HS2, AI-Dev-BMS4-2, etc.):

  1. Assign label — use the next available slot for the host (IO1→IO2, HS1→HS2, BMS4-1→BMS4-2)

  2. Provision VPS — run the /provision-vps skill in a p24-infra Claude session; install Claude Code:

    npm install -g @anthropic-ai/claude-code
  3. Register email — add Cloudflare Email Routing rule via API:

    cf("POST", f"/zones/{ZONE}/email/routing/rules", {
        "name": "AI-Dev-XXX",
        "enabled": True,
        "matchers": [{"type": "literal", "field": "to", "value": "ai-dev-xxx@zintegrowana.online"}],
        "actions": [{"type": "forward", "value": ["radieu@gmail.com"]}]
    })

    Zone ID: 57cb3d8f24c7cc319fb703394edc7b87

  4. Create GitHub account — browser signup using ai-dev-xxx@zintegrowana.online

  5. Add as collaborator — invite to both repos:

    gh api repos/radieu/p24-infra/collaborators/AI-Dev-XXX -X PUT -f permission=write
    gh api repos/radieu/et-operational-platform/collaborators/AI-Dev-XXX -X PUT -f permission=write
  6. Create human-action issue in radieu/p24-infra with label human-action: “Confirm GitHub invitations for AI-Dev-XXX”

  7. Accept invitations — human (radieu) clicks both invitation emails, then closes the issue

  8. Copy credentials — copy ~/.claude/.credentials.json from local or another agent:

    scp C:\Users\konar\.claude\.credentials.json root@<new-vps>:/home/claude-runner/.claude/
    chown claude-runner:claude-runner /home/claude-runner/.claude/.credentials.json

6. Monitoring Agent Health

GitHub Actions (primary signal)

  • Check https://github.com/radieu/p24-infra/actions for recent workflow runs
  • Failed runs indicate the agent encountered errors
  • Long-pending runs may indicate capacity exhaustion or auth failure

Discord alerts

All auth expiry events and agent errors post to #infra-alerts via Discord webhook. Variable name: P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL

Cost monitoring

Agent / monthClaude Max planApprox. cost
Per seat$10 / 20 accounts flat~€9.25/agent
3 active agents~€27.75/month

Claude Max subscription is a flat per-account fee — cost does not scale with usage volume.


7. Troubleshooting

Auth expired

Symptom: Discord alert, or GitHub Actions logs show 401 Unauthorized from Claude.

Fix: Run python d:\tmp\reauth-hstgr.py on local Windows machine. Wait ~2 min, then retry the failing workflow.

Agent stuck / no response

Symptom: GitHub Actions run in progress for >30 min with no output.

Fix:

ssh root@<affected-vps>
ps aux | grep claude
kill -9 <pid>   # only if confirmed hung

Re-run the GitHub Actions workflow manually after killing the process.

Runner capacity exceeded

Symptom: GitHub Actions jobs queue but never start; runner shows busy.

Fix: Check how many parallel Claude Code processes are running:

ps aux | grep claude | wc -l

If at limit, wait for active jobs to finish. Do not increase parallelism beyond the limits in §1 without first verifying available RAM.

Agent out of date (stale repo)

Symptom: agent operates on old code; missing recent commits.

Fix:

ssh <agent-host>
cd /opt/p24-infra   # or wherever repo is cloned
git pull origin main

8. Agent Safety Rules

All agents in this fleet must follow these rules without exception:

  1. Never delete or overwrite without a recovery path — confirm before any irreversible action
  2. Never cause infrastructure downtime without explicit human permission
  3. Never push directly to main, dev, staging, or rc — all changes via PR
  4. Ask human when: action has data loss risk · regression risk · requires human steps you cannot complete autonomously · decision affects shared infrastructure
  5. Prefer reversible: restart > stop · migrate > drop · rename > delete · branch > overwrite
  6. Security: never log secrets · never commit .env · rotate credentials immediately on suspected exposure