Nightly Infra Checks + Hourly DevOps Triage

GitHub issue: #600
Status: Active
Registered in: dev_r_services as nightly-infra-check-agent and hourly-devops-triage-agent

Architecture overview

Two independent scheduled cloud agents managed via /schedule skill (CronCreate):

Agent	Cron	Prompt
Nightly infra check	`30 0 * * *` (00:30 UTC)	`.claude/commands/nightly-infra-check.md`
Hourly DevOps triage	`0 * * * *` (every hour)	`.claude/commands/hourly-devops-triage.md`

Neither agent runs as a persistent service. They are fire-and-forget cloud agent runs that terminate when their task is done.

Nightly infra check

What it checks

#	Check	Target	Alert threshold
1	Docker container health	vps-i1, vps-h1, bms-3, bms-4	Any container status != running, or restart count increased
2	MongoDB rs0 member states	bms-2, bms-3, bms-4	Any member not PRIMARY/SECONDARY/ARBITER, replication lag > 30s
3	Disk usage	All 5 servers	>= 85% on any mount
4	Prometheus scrape targets	vps-i1 Prometheus API	Any target `state=down` or `health=down`
5	Prometheus active alerts	vps-i1 Prometheus API	Any alert `state=firing`
6	n8n workflow execution errors	bms-4 n8n API	Any error executions in last 24h
7	Wasabi backup freshness	backup-exporter endpoint	Backup older than 25h
8	Infisical CE health	Infisical health endpoint	Non-200 response
9	Documentation compliance gaps	Supabase `dev_r_services`	`compliance_workbook = 'no'` for active services
10	Audit action failures	Supabase `audit.actions`	`last_run_status = 'failed'` in last 24h

Issue creation format

For any problem detected:

Title: [nightly-check] {component}: {short description}
Label: bug (infra errors) | human-action (needs manual review) | patch (doc gaps)
Milestone: Triage
Body:
  ## Problem detected by nightly infra check

  **Component:** {component}
  **Severity:** {P1-Critical / P2-Warning / P3-Info}
  **Detected at:** {UTC timestamp}

  ### What was found
  {description}

  ### Raw output

{relevant snippet}


### Suggested action
{what the agent recommends}

Clean-run summary

When all checks pass, the agent comments on the pinned “nightly health” issue (search for [nightly-health-status] in open issues). If no such issue exists, it creates one.

Comment format:

## Nightly check — {YYYY-MM-DD} UTC

All checks passed.

| Check | Status |
|-------|--------|
| Docker containers (all hosts) | OK |
| MongoDB rs0 | OK |
| Disk usage | OK |
| Prometheus targets | OK |
| Active alerts | OK |
| n8n executions | OK |
| Wasabi backup freshness | OK |
| Infisical CE | OK |
| Documentation compliance | OK |
| Audit action status | OK |

Error notification standard

If the agent itself crashes (unhandled exception):

Discord embed via P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL
GitHub issue: [nightly-check] Agent crash: {error message}, label=bug, milestone=Triage

Hourly DevOps triage

Flow

1. Fetch open GH issues with milestone=Triage (radieu/p24-infra)
2. For each issue:
   a. Read title + body + labels
   b. Classify: auto-fixable or human-required
   c. Auto-fixable → fix → commit → PR → move issue to In Progress
   d. Human-required → add label human-action → post triage comment → Telegram alert

Auto-fixable categories

Category	Action
Docker container down/restarting (simple)	SSH, `docker restart {container}`, verify running
Prometheus config reload needed	`curl -X POST http://localhost:9090/-/reload`
Documentation compliance gap (`compliance_workbook=no`)	Update `dev_r_services` row, set `workbook_url`
Stale credential rotation reminder	Post reminder comment, update `next_due` in `dev_r_services`
Grafana dashboard error (simple config)	Fix config file, reload

Safety rule: agent only auto-fixes when confident (keyword match + known safe pattern). If in doubt, it escalates as human-required.

Human-required categories

Category	Why human needed
Data loss risk	Any action that could delete or corrupt data
Multi-step infra change	Requires coordination across multiple services
Secret/credential rotation	Must be done by a human with vault access
Production service restart	Potential downtime — needs explicit approval
MongoDB rs0 reconfiguration	Voting/quorum changes have outage risk
Unknown error pattern	Agent cannot safely determine cause
Disk full (> 95%)	Requires human to decide what to delete

Telegram notifications

Bot credentials (already configured from IS-595):

Token: waha-telegram-bot-token (stored in .env.local and Infisical)
Chat ID: WAHA_TELEGRAM_CHAT_ID

Notification format:

P24 Infra — Human action required

Issue #NNN: {title}
Reason: {why human is needed}
Link: https://github.com/radieu/p24-infra/issues/NNN

API call:

POST https://api.telegram.org/bot{TOKEN}/sendMessage
{
  "chat_id": "{WAHA_TELEGRAM_CHAT_ID}",
  "text": "...",
  "parse_mode": "HTML"
}

Triage comment format

Posted on each human-required issue:

## Hourly triage — {UTC timestamp}
 
**Classification:** Human action required
**Reason:** {explanation}
**Urgency:** {P1 / P2 / P3}
 
Telegram notification sent to operator.

Error notification standard

Same as nightly check: Discord embed + GH issue on agent crash.

Activating the cron agents

Use the /schedule skill to register both commands:

/schedule nightly-infra-check  cron: 30 0 * * *   prompt-file: .claude/commands/nightly-infra-check.md
/schedule hourly-devops-triage cron: 0 * * * *    prompt-file: .claude/commands/hourly-devops-triage.md

Both agents run as cloud agents (not on a VPS). They do SSH outbound from the cloud agent to the VPS nodes.

Manually triggering either agent

# Nightly check — one-shot
claude --file .claude/commands/nightly-infra-check.md
 
# Hourly triage — one-shot
claude --file .claude/commands/hourly-devops-triage.md

Or from this repo on Windows:

claude --file "C:\code_2026\p24-infra\.claude\commands\nightly-infra-check.md"
claude --file "C:\code_2026\p24-infra\.claude\commands\hourly-devops-triage.md"

Interpreting generated issues

Issues created by these agents always have title prefix [nightly-check] or [triage].

Prefix	Source
`[nightly-check] Docker:`	Container unhealthy or restarting
`[nightly-check] MongoDB:`	Replica set problem
`[nightly-check] Disk:`	Disk >= 85% on a server
`[nightly-check] Prometheus:`	Scrape target down or alert firing
`[nightly-check] n8n:`	Workflow execution error
`[nightly-check] Backup:`	Wasabi backup stale
`[nightly-check] Infisical:`	Secret sync problem
`[nightly-check] Docs:`	Documentation compliance gap
`[nightly-check] Audit:`	audit.actions failure
`[triage] Auto-fixed:`	Issue was fixed autonomously (body has PR link)
`[triage] Human-required:`	Telegram sent, awaiting human action

SSH access used by the nightly agent

The agent uses the standard p24-infra SSH key (id_ed25519) and connects as:

Server	User	IP
vps-i1	root	217.154.82.162
vps-h1	root	72.60.32.61
bms-2	ubuntu	145.239.133.104
bms-3	ubuntu	51.68.155.224
bms-4	ubuntu	54.36.123.110

Commands used: docker ps --format json, df -h, mongosh --eval 'JSON.stringify(rs.status())' (via SSH exec — read-only).

monitoring-stack-operations.md — Prometheus/Grafana/Alertmanager
n8n-operations.md — n8n on bms-4
alert-response-runbook.md — how to respond to P1 alerts
servers — per-server operations docs

p24-infra Docs

Explorer

nightly-checks-triage

Nightly Infra Checks + Hourly DevOps Triage

Architecture overview

Nightly infra check

What it checks

Issue creation format

Clean-run summary

Error notification standard

Hourly DevOps triage

Flow

Auto-fixable categories

Human-required categories

Telegram notifications

Triage comment format

Error notification standard

Activating the cron agents

Manually triggering either agent

Interpreting generated issues

SSH access used by the nightly agent

Graph View

Table of Contents

Backlinks

p24-infra Docs

Explorer

nightly-checks-triage

Nightly Infra Checks + Hourly DevOps Triage

Architecture overview

Nightly infra check

What it checks

Issue creation format

Clean-run summary

Error notification standard

Hourly DevOps triage

Flow

Auto-fixable categories

Human-required categories

Telegram notifications

Triage comment format

Error notification standard

Activating the cron agents

Manually triggering either agent

Interpreting generated issues

SSH access used by the nightly agent

Related docs

Graph View

Table of Contents

Backlinks