Nightly Infra Checks + Hourly DevOps Triage

GitHub issue: #600
Status: Active
Registered in: dev_r_services as nightly-infra-check-agent and hourly-devops-triage-agent


Architecture overview

Two independent scheduled cloud agents managed via /schedule skill (CronCreate):

AgentCronPrompt
Nightly infra check30 0 * * * (00:30 UTC).claude/commands/nightly-infra-check.md
Hourly DevOps triage0 * * * * (every hour).claude/commands/hourly-devops-triage.md

Neither agent runs as a persistent service. They are fire-and-forget cloud agent runs that terminate when their task is done.


Nightly infra check

What it checks

#CheckTargetAlert threshold
1Docker container healthvps-i1, vps-h1, bms-3, bms-4Any container status != running, or restart count increased
2MongoDB rs0 member statesbms-2, bms-3, bms-4Any member not PRIMARY/SECONDARY/ARBITER, replication lag > 30s
3Disk usageAll 5 servers>= 85% on any mount
4Prometheus scrape targetsvps-i1 Prometheus APIAny target state=down or health=down
5Prometheus active alertsvps-i1 Prometheus APIAny alert state=firing
6n8n workflow execution errorsbms-4 n8n APIAny error executions in last 24h
7Wasabi backup freshnessbackup-exporter endpointBackup older than 25h
8Infisical CE healthInfisical health endpointNon-200 response
9Documentation compliance gapsSupabase dev_r_servicescompliance_workbook = 'no' for active services
10Audit action failuresSupabase audit.actionslast_run_status = 'failed' in last 24h

Issue creation format

For any problem detected:

Title: [nightly-check] {component}: {short description}
Label: bug (infra errors) | human-action (needs manual review) | patch (doc gaps)
Milestone: Triage
Body:
  ## Problem detected by nightly infra check

  **Component:** {component}
  **Severity:** {P1-Critical / P2-Warning / P3-Info}
  **Detected at:** {UTC timestamp}

  ### What was found
  {description}

  ### Raw output

{relevant snippet}


### Suggested action
{what the agent recommends}

Clean-run summary

When all checks pass, the agent comments on the pinned “nightly health” issue (search for [nightly-health-status] in open issues). If no such issue exists, it creates one.

Comment format:

## Nightly check — {YYYY-MM-DD} UTC

All checks passed.

| Check | Status |
|-------|--------|
| Docker containers (all hosts) | OK |
| MongoDB rs0 | OK |
| Disk usage | OK |
| Prometheus targets | OK |
| Active alerts | OK |
| n8n executions | OK |
| Wasabi backup freshness | OK |
| Infisical CE | OK |
| Documentation compliance | OK |
| Audit action status | OK |

Error notification standard

If the agent itself crashes (unhandled exception):

  1. Discord embed via P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL
  2. GitHub issue: [nightly-check] Agent crash: {error message}, label=bug, milestone=Triage

Hourly DevOps triage

Flow

1. Fetch open GH issues with milestone=Triage (radieu/p24-infra)
2. For each issue:
   a. Read title + body + labels
   b. Classify: auto-fixable or human-required
   c. Auto-fixable → fix → commit → PR → move issue to In Progress
   d. Human-required → add label human-action → post triage comment → Telegram alert

Auto-fixable categories

CategoryAction
Docker container down/restarting (simple)SSH, docker restart {container}, verify running
Prometheus config reload neededcurl -X POST http://localhost:9090/-/reload
Documentation compliance gap (compliance_workbook=no)Update dev_r_services row, set workbook_url
Stale credential rotation reminderPost reminder comment, update next_due in dev_r_services
Grafana dashboard error (simple config)Fix config file, reload

Safety rule: agent only auto-fixes when confident (keyword match + known safe pattern). If in doubt, it escalates as human-required.

Human-required categories

CategoryWhy human needed
Data loss riskAny action that could delete or corrupt data
Multi-step infra changeRequires coordination across multiple services
Secret/credential rotationMust be done by a human with vault access
Production service restartPotential downtime — needs explicit approval
MongoDB rs0 reconfigurationVoting/quorum changes have outage risk
Unknown error patternAgent cannot safely determine cause
Disk full (> 95%)Requires human to decide what to delete

Telegram notifications

Bot credentials (already configured from IS-595):

  • Token: waha-telegram-bot-token (stored in .env.local and Infisical)
  • Chat ID: WAHA_TELEGRAM_CHAT_ID

Notification format:

P24 Infra — Human action required

Issue #NNN: {title}
Reason: {why human is needed}
Link: https://github.com/radieu/p24-infra/issues/NNN

API call:

POST https://api.telegram.org/bot{TOKEN}/sendMessage
{
  "chat_id": "{WAHA_TELEGRAM_CHAT_ID}",
  "text": "...",
  "parse_mode": "HTML"
}

Triage comment format

Posted on each human-required issue:

## Hourly triage — {UTC timestamp}
 
**Classification:** Human action required
**Reason:** {explanation}
**Urgency:** {P1 / P2 / P3}
 
Telegram notification sent to operator.

Error notification standard

Same as nightly check: Discord embed + GH issue on agent crash.


Activating the cron agents

Use the /schedule skill to register both commands:

/schedule nightly-infra-check  cron: 30 0 * * *   prompt-file: .claude/commands/nightly-infra-check.md
/schedule hourly-devops-triage cron: 0 * * * *    prompt-file: .claude/commands/hourly-devops-triage.md

Both agents run as cloud agents (not on a VPS). They do SSH outbound from the cloud agent to the VPS nodes.


Manually triggering either agent

# Nightly check — one-shot
claude --file .claude/commands/nightly-infra-check.md
 
# Hourly triage — one-shot
claude --file .claude/commands/hourly-devops-triage.md

Or from this repo on Windows:

claude --file "C:\code_2026\p24-infra\.claude\commands\nightly-infra-check.md"
claude --file "C:\code_2026\p24-infra\.claude\commands\hourly-devops-triage.md"

Interpreting generated issues

Issues created by these agents always have title prefix [nightly-check] or [triage].

PrefixSource
[nightly-check] Docker:Container unhealthy or restarting
[nightly-check] MongoDB:Replica set problem
[nightly-check] Disk:Disk >= 85% on a server
[nightly-check] Prometheus:Scrape target down or alert firing
[nightly-check] n8n:Workflow execution error
[nightly-check] Backup:Wasabi backup stale
[nightly-check] Infisical:Secret sync problem
[nightly-check] Docs:Documentation compliance gap
[nightly-check] Audit:audit.actions failure
[triage] Auto-fixed:Issue was fixed autonomously (body has PR link)
[triage] Human-required:Telegram sent, awaiting human action

SSH access used by the nightly agent

The agent uses the standard p24-infra SSH key (id_ed25519) and connects as:

ServerUserIP
vps-i1root217.154.82.162
vps-h1root72.60.32.61
bms-2ubuntu145.239.133.104
bms-3ubuntu51.68.155.224
bms-4ubuntu54.36.123.110

Commands used: docker ps --format json, df -h, mongosh --eval 'JSON.stringify(rs.status())' (via SSH exec — read-only).