Nightly Infra Checks + Hourly DevOps Triage
GitHub issue: #600
Status: Active
Registered in: dev_r_services as nightly-infra-check-agent and hourly-devops-triage-agent
Architecture overview
Two independent scheduled cloud agents managed via /schedule skill (CronCreate):
| Agent | Cron | Prompt |
|---|---|---|
| Nightly infra check | 30 0 * * * (00:30 UTC) | .claude/commands/nightly-infra-check.md |
| Hourly DevOps triage | 0 * * * * (every hour) | .claude/commands/hourly-devops-triage.md |
Neither agent runs as a persistent service. They are fire-and-forget cloud agent runs that terminate when their task is done.
Nightly infra check
What it checks
| # | Check | Target | Alert threshold |
|---|---|---|---|
| 1 | Docker container health | vps-i1, vps-h1, bms-3, bms-4 | Any container status != running, or restart count increased |
| 2 | MongoDB rs0 member states | bms-2, bms-3, bms-4 | Any member not PRIMARY/SECONDARY/ARBITER, replication lag > 30s |
| 3 | Disk usage | All 5 servers | >= 85% on any mount |
| 4 | Prometheus scrape targets | vps-i1 Prometheus API | Any target state=down or health=down |
| 5 | Prometheus active alerts | vps-i1 Prometheus API | Any alert state=firing |
| 6 | n8n workflow execution errors | bms-4 n8n API | Any error executions in last 24h |
| 7 | Wasabi backup freshness | backup-exporter endpoint | Backup older than 25h |
| 8 | Infisical CE health | Infisical health endpoint | Non-200 response |
| 9 | Documentation compliance gaps | Supabase dev_r_services | compliance_workbook = 'no' for active services |
| 10 | Audit action failures | Supabase audit.actions | last_run_status = 'failed' in last 24h |
Issue creation format
For any problem detected:
Title: [nightly-check] {component}: {short description}
Label: bug (infra errors) | human-action (needs manual review) | patch (doc gaps)
Milestone: Triage
Body:
## Problem detected by nightly infra check
**Component:** {component}
**Severity:** {P1-Critical / P2-Warning / P3-Info}
**Detected at:** {UTC timestamp}
### What was found
{description}
### Raw output
{relevant snippet}
### Suggested action
{what the agent recommends}
Clean-run summary
When all checks pass, the agent comments on the pinned “nightly health” issue (search for [nightly-health-status] in open issues). If no such issue exists, it creates one.
Comment format:
## Nightly check — {YYYY-MM-DD} UTC
All checks passed.
| Check | Status |
|-------|--------|
| Docker containers (all hosts) | OK |
| MongoDB rs0 | OK |
| Disk usage | OK |
| Prometheus targets | OK |
| Active alerts | OK |
| n8n executions | OK |
| Wasabi backup freshness | OK |
| Infisical CE | OK |
| Documentation compliance | OK |
| Audit action status | OK |
Error notification standard
If the agent itself crashes (unhandled exception):
- Discord embed via
P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL - GitHub issue:
[nightly-check] Agent crash: {error message}, label=bug, milestone=Triage
Hourly DevOps triage
Flow
1. Fetch open GH issues with milestone=Triage (radieu/p24-infra)
2. For each issue:
a. Read title + body + labels
b. Classify: auto-fixable or human-required
c. Auto-fixable → fix → commit → PR → move issue to In Progress
d. Human-required → add label human-action → post triage comment → Telegram alert
Auto-fixable categories
| Category | Action |
|---|---|
| Docker container down/restarting (simple) | SSH, docker restart {container}, verify running |
| Prometheus config reload needed | curl -X POST http://localhost:9090/-/reload |
Documentation compliance gap (compliance_workbook=no) | Update dev_r_services row, set workbook_url |
| Stale credential rotation reminder | Post reminder comment, update next_due in dev_r_services |
| Grafana dashboard error (simple config) | Fix config file, reload |
Safety rule: agent only auto-fixes when confident (keyword match + known safe pattern). If in doubt, it escalates as human-required.
Human-required categories
| Category | Why human needed |
|---|---|
| Data loss risk | Any action that could delete or corrupt data |
| Multi-step infra change | Requires coordination across multiple services |
| Secret/credential rotation | Must be done by a human with vault access |
| Production service restart | Potential downtime — needs explicit approval |
| MongoDB rs0 reconfiguration | Voting/quorum changes have outage risk |
| Unknown error pattern | Agent cannot safely determine cause |
| Disk full (> 95%) | Requires human to decide what to delete |
Telegram notifications
Bot credentials (already configured from IS-595):
- Token:
waha-telegram-bot-token(stored in.env.localand Infisical) - Chat ID:
WAHA_TELEGRAM_CHAT_ID
Notification format:
P24 Infra — Human action required
Issue #NNN: {title}
Reason: {why human is needed}
Link: https://github.com/radieu/p24-infra/issues/NNN
API call:
POST https://api.telegram.org/bot{TOKEN}/sendMessage
{
"chat_id": "{WAHA_TELEGRAM_CHAT_ID}",
"text": "...",
"parse_mode": "HTML"
}Triage comment format
Posted on each human-required issue:
## Hourly triage — {UTC timestamp}
**Classification:** Human action required
**Reason:** {explanation}
**Urgency:** {P1 / P2 / P3}
Telegram notification sent to operator.Error notification standard
Same as nightly check: Discord embed + GH issue on agent crash.
Activating the cron agents
Use the /schedule skill to register both commands:
/schedule nightly-infra-check cron: 30 0 * * * prompt-file: .claude/commands/nightly-infra-check.md
/schedule hourly-devops-triage cron: 0 * * * * prompt-file: .claude/commands/hourly-devops-triage.md
Both agents run as cloud agents (not on a VPS). They do SSH outbound from the cloud agent to the VPS nodes.
Manually triggering either agent
# Nightly check — one-shot
claude --file .claude/commands/nightly-infra-check.md
# Hourly triage — one-shot
claude --file .claude/commands/hourly-devops-triage.mdOr from this repo on Windows:
claude --file "C:\code_2026\p24-infra\.claude\commands\nightly-infra-check.md"
claude --file "C:\code_2026\p24-infra\.claude\commands\hourly-devops-triage.md"Interpreting generated issues
Issues created by these agents always have title prefix [nightly-check] or [triage].
| Prefix | Source |
|---|---|
[nightly-check] Docker: | Container unhealthy or restarting |
[nightly-check] MongoDB: | Replica set problem |
[nightly-check] Disk: | Disk >= 85% on a server |
[nightly-check] Prometheus: | Scrape target down or alert firing |
[nightly-check] n8n: | Workflow execution error |
[nightly-check] Backup: | Wasabi backup stale |
[nightly-check] Infisical: | Secret sync problem |
[nightly-check] Docs: | Documentation compliance gap |
[nightly-check] Audit: | audit.actions failure |
[triage] Auto-fixed: | Issue was fixed autonomously (body has PR link) |
[triage] Human-required: | Telegram sent, awaiting human action |
SSH access used by the nightly agent
The agent uses the standard p24-infra SSH key (id_ed25519) and connects as:
| Server | User | IP |
|---|---|---|
| vps-i1 | root | 217.154.82.162 |
| vps-h1 | root | 72.60.32.61 |
| bms-2 | ubuntu | 145.239.133.104 |
| bms-3 | ubuntu | 51.68.155.224 |
| bms-4 | ubuntu | 54.36.123.110 |
Commands used: docker ps --format json, df -h, mongosh --eval 'JSON.stringify(rs.status())' (via SSH exec — read-only).
Related docs
- monitoring-stack-operations.md — Prometheus/Grafana/Alertmanager
- n8n-operations.md — n8n on bms-4
- alert-response-runbook.md — how to respond to P1 alerts
- servers — per-server operations docs