03 — Nightly Operations & MongoDB rs0 Maintenance

Status: Design document — 2026-06-14
Scope: AI-Dev-BMS4 agent setup · nightly operations checklist · MongoDB rs0 maintenance plan
Servers covered: bms-4 (54.36.123.110) · bms-3 (51.68.155.224) · bms-2 (145.239.133.104)
Related docs: p4-ovh-bms-4-ns3101999-operations.md · p4-ovh-bms-3-ns3129867-operations.md · p4-ovh-bms-2-ns3087638-operations.md

Part 1: AI-Dev-BMS4 Agent Design

Agent Overview
Installation Checklist
Issue Pickup Logic
Escalation Rules
Agent Configuration

Part 2: Nightly Operations 6. Nightly Operations Schedule 7. Tier 1 Critical Service Checks 8. Tier 2 Platform Checks 9. Tier 3 Quality Checks 10. Supabase Maintenance 11. n8n Workflow Health 12. Disk Usage Monitoring 13. Security Nightly Checks

Part 3: MongoDB rs0 Maintenance 14. Replica Set Health Dashboard 15. Regular Maintenance Schedule 16. Maintenance Procedures 17. Backup Strategy 18. Failover Runbook 19. Capacity Planning Guide 20. Alert Definitions

Part 1: AI-Dev-BMS4 Agent Design

1. Agent Overview

Role

AI-Dev-BMS4 is the autonomous Claude Code agent deployed on bms-4 (54.36.123.110). Its primary responsibility is nightly infrastructure issue processing — it picks up GitHub issues created by GitHub Actions health checks, attempts automated remediation, and escalates unresolvable problems to human operators.

Position in the Agent Fleet

Agent	Host	Role	Max Parallel
AI-Dev-IO1	vps-i1 (IONOS)	et-operational-platform issue processing	2–3
AI-Dev-HS1	vps-h1 (Hostinger)	p24-infra issue pipeline + claude-proxy	1–2
AI-Dev-OV1	bms-2 (OVH)	dev/test workloads	4
AI-Dev-BMS4	bms-4 (OVH)	nightly p24-infra ops + MongoDB maintenance	4

AI-Dev-BMS4 is specifically designed for the 02:00–06:00 UTC nightly window when GitHub Actions have generated issues from health checks and infrastructure scans. It runs on the server with the most free RAM (30+ GB free) and disk space (1.7 TB free).

Capabilities

Clone and operate on radieu/p24-infra repository (dedicated clone at /home/claude-runner/p24-infra)
Run diagnostic commands: curl, docker, mongosh, ssh (read-only diagnostics)
Create GitHub issues, add comments, apply labels, and open PRs to dev
Query Prometheus metrics API for service health
Send Discord notifications for immediate alerts
Restart failed Docker containers on bms-4 only (owns its own host)
SSH read-only access to vps-i1, vps-h1, bms-2, bms-3 for diagnostics

Constraints

Never write to production databases or modify MongoDB data
Never restart containers on remote servers (vps-h1, bms-3) — SSH is read-only for diagnostics
Never push directly to main — all changes via PR to dev
Never expose secret values in GitHub issue comments
Always create a recovery path before any action with data loss risk

2. Installation Checklist

bms-4 currently runs as root (OVH bare metal default). The following steps bring it into the same pattern as other agent VPSes.

Step 1 — Create claude-runner user

ssh root@54.36.123.110
 
# Create dedicated user (no password, no sudo by default)
useradd -m -s /bin/bash claude-runner
usermod -aG docker claude-runner   # allow docker commands on this host only
 
# Create SSH directory for agent access
mkdir -p /home/claude-runner/.ssh
chmod 700 /home/claude-runner/.ssh
chown -R claude-runner:claude-runner /home/claude-runner/.ssh
 
# Create .claude directory for credentials
mkdir -p /home/claude-runner/.claude
chown -R claude-runner:claude-runner /home/claude-runner/.claude

Step 2 — Install Claude Code

# Install Node.js 22.x (required by Claude Code)
curl -fsSL https://deb.nodesource.com/setup_22.x | bash -
apt-get install -y nodejs
 
# Install Claude Code globally
npm install -g @anthropic-ai/claude-code
 
# Verify
claude --version
which claude   # expect /usr/bin/claude or /usr/local/bin/claude

Step 3 — Copy OAuth credentials

From local workstation, copy a valid .credentials.json that has both accessToken and refreshToken:

# From local Windows workstation
scp C:\Users\konar\.claude\.credentials.json root@54.36.123.110:/home/claude-runner/.claude/.credentials.json
ssh root@54.36.123.110 "chown claude-runner:claude-runner /home/claude-runner/.claude/.credentials.json && chmod 600 /home/claude-runner/.claude/.credentials.json"

Verify Claude Code authenticates:

su - claude-runner -c "claude --version"

Step 4 — Create SSH key for remote diagnostics

The agent needs read-only SSH access to other servers for diagnostic commands.

# As root on bms-4 — generate key for claude-runner
su - claude-runner -c "ssh-keygen -t ed25519 -f /home/claude-runner/.ssh/id_bms4_agent -C 'ai-dev-bms4@bms-4' -N ''"
 
# Display public key — copy this to authorized_keys on other servers
cat /home/claude-runner/.ssh/id_bms4_agent.pub

Then on each target server, add the public key to the read-only diagnostic user:

# On vps-i1 (IONOS) — add to claude-admin for diagnostic SSH
ssh root@217.154.82.162 "echo '<bms4_agent_pubkey>' >> /home/claude-admin/.ssh/authorized_keys"
 
# On vps-h1 (Hostinger) — same
ssh root@72.60.32.61 "echo '<bms4_agent_pubkey>' >> /home/claude-admin/.ssh/authorized_keys"
 
# On bms-3 — same
ssh ubuntu@51.68.155.224 "echo '<bms4_agent_pubkey>' >> /home/ubuntu/.ssh/authorized_keys"

Configure SSH client to use the correct key per host:

cat > /home/claude-runner/.ssh/config << 'EOF'
Host vps-i1 217.154.82.162
    HostName 217.154.82.162
    User claude-admin
    IdentityFile ~/.ssh/id_bms4_agent
    StrictHostKeyChecking no
    ConnectTimeout 10
 
Host vps-h1 72.60.32.61
    HostName 72.60.32.61
    User root
    IdentityFile ~/.ssh/id_bms4_agent
    StrictHostKeyChecking no
    ConnectTimeout 10
 
Host bms-3 51.68.155.224
    HostName 51.68.155.224
    User ubuntu
    IdentityFile ~/.ssh/id_bms4_agent
    StrictHostKeyChecking no
    ConnectTimeout 10
 
Host bms-2 145.239.133.104
    HostName 145.239.133.104
    User ubuntu
    IdentityFile ~/.ssh/id_bms4_agent
    StrictHostKeyChecking no
    ConnectTimeout 10
EOF
chown claude-runner:claude-runner /home/claude-runner/.ssh/config
chmod 600 /home/claude-runner/.ssh/config

Step 5 — Set up GitHub credentials

# Install gh CLI
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg \
  | dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" \
  | tee /etc/apt/sources.list.d/github-cli.list
apt-get update && apt-get install -y gh
 
# Create env file with GitHub token — populated from .env.local
cat > /home/claude-runner/.claude-env << 'EOF'
export GITHUB_TOKEN=<value from .env.local P24_INFRA_GH_TOKEN>
export GH_TOKEN=$GITHUB_TOKEN
export DISCORD_WEBHOOK_URL=<value from .env.local P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL>
export PROMETHEUS_URL=http://217.154.82.162:9090
EOF
chmod 600 /home/claude-runner/.claude-env
chown claude-runner:claude-runner /home/claude-runner/.claude-env

Never store the actual token values in this document. Populate .claude-env from .env.local on the local workstation.

Step 6 — Clone p24-infra repository

# As claude-runner — dedicated clone (NOT /opt/p24-infra which is the deployment copy)
su - claude-runner
 
source ~/.claude-env
git clone https://${GITHUB_TOKEN}@github.com/radieu/p24-infra.git ~/p24-infra
cd ~/p24-infra
git checkout dev
git remote set-url origin https://github.com/radieu/p24-infra.git  # remove token from remote URL

The wrapper script injects the token at runtime.

Step 7 — Create wrapper script

cat > /root/bms4-nightly.sh << 'SCRIPT'
#!/usr/bin/env bash
set -euo pipefail
LOG="/var/log/bms4-nightly.log"
echo "=== bms4-nightly START $(date -u +%Y-%m-%dT%H:%M:%SZ) ===" >> "$LOG"
 
# Run as claude-runner
runuser -u claude-runner -- bash -c '
  source ~/.claude-env
  cd ~/p24-infra
  git remote set-url origin "https://${GITHUB_TOKEN}@github.com/radieu/p24-infra.git"
  git fetch origin dev
  git reset --hard origin/dev
  git remote set-url origin "https://github.com/radieu/p24-infra.git"
  claude --dangerously-skip-permissions -p "/process-issues" \
    --allowedTools "Bash,Read,Edit,Write,Glob,Grep,PowerShell"
' >> "$LOG" 2>&1
 
echo "=== bms4-nightly END $(date -u +%Y-%m-%dT%H:%M:%SZ) ===" >> "$LOG"
SCRIPT
chmod +x /root/bms4-nightly.sh

Step 8 — Create cron job

# Nightly at 02:05 UTC (5 min after GitHub Actions health-check creates issues)
cat > /etc/cron.d/bms4-nightly << 'EOF'
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
 
5 2 * * * root /root/bms4-nightly.sh >> /var/log/bms4-nightly.log 2>&1
EOF
chmod 644 /etc/cron.d/bms4-nightly

Step 9 — Create CLAUDE.md for the agent

See Section 5 — Agent Configuration for the full CLAUDE.md content.

# Place the CLAUDE.md in the clone
cp /opt/p24-infra/docs/evaluation/bms4-agent-CLAUDE.md \
   /home/claude-runner/p24-infra/.claude/CLAUDE.md

Step 10 — Register in GitHub as AI-Dev-BMS4

Following the AI runner provisioning procedure from CLAUDE.md:

Create email routing rule in Cloudflare for ai-dev-bms4@zintegrowana.online
Sign up GitHub account with that email
Add as collaborator to radieu/p24-infra with write permission
Create human-action issue: “Confirm GitHub invitation for AI-Dev-BMS4”

Step 11 — Verify

# Test Claude Code auth
su - claude-runner -c "claude --version && echo AUTH_OK"
 
# Test GitHub auth
su - claude-runner -c "source ~/.claude-env && gh auth status"
 
# Test repo access
su - claude-runner -c "source ~/.claude-env && cd ~/p24-infra && gh issue list --repo radieu/p24-infra --limit 5"
 
# Test SSH diagnostics
su - claude-runner -c "ssh vps-i1 'docker ps --format table 2>/dev/null | head -5'"
 
# Test Discord notification
su - claude-runner -c "source ~/.claude-env && curl -s -X POST \"\$DISCORD_WEBHOOK_URL\" \
  -H 'Content-Type: application/json' \
  -d '{\"content\":\"AI-Dev-BMS4 setup verified — nightly agent ready\"}'"

3. Issue Pickup Logic

Triage Decision Tree

New GitHub issue in radieu/p24-infra
  │
  ├── Labels include "atrax-stale"
  │     → check n8n workflow status on bms-4
  │     → if n8n healthy, attempt workflow restart via API
  │     → if n8n down, escalate immediately (Tier 1)
  │
  ├── Labels include "server-down" / "infra-check-fail"
  │     → identify failed component from issue title
  │     → run targeted diagnostic (curl, ssh, docker ps)
  │     → if container restart would fix: restart it
  │     → if requires human SSH/physical: escalate
  │
  ├── Labels include "failed-gh-actions"
  │     → check workflow logs via gh API
  │     → if transient (timeout, network): re-trigger workflow
  │     → if code/config bug: open fix PR
  │     → if secret expired / runner down: escalate
  │
  ├── Labels include "security"
  │     → never auto-remediate; always escalate to human
  │     → add "human-action" label immediately
  │
  ├── Labels include "triage" only (no routing label)
  │     → apply /process-issues skill for standard triage
  │
  └── All other issues
        → apply /process-issues skill (Design → In Progress → PR)

Issue Resolution Flow

1. CLAIM  — gh issue edit #N --add-label "in-progress"
           — add comment: "AI-Dev-BMS4 picking up this issue [timestamp]"

2. DIAGNOSE — run relevant check commands (see Part 2)
            — record findings in a comment

3. ACT
   ├── Resolvable automatically:
   │     — implement fix (code change, config update, container restart)
   │     — open PR to dev
   │     — comment: "Fix in PR #NN"
   │
   └── Not resolvable:
         — add comment: "Diagnostics complete. Root cause: <description>"
         — add label "human-action"
         — send Discord alert (see Section 4)

4. CLOSE (if fully resolved by PR merge, or label human-action and move on)

Capacity Limit

AI-Dev-BMS4 runs up to 4 parallel Claude Code processes during the nightly window (02:00–06:00 UTC). Use Supabase agent_sessions table with worker_env = 'bms4' to prevent over-claiming:

# Check active sessions before spawning
ACTIVE=$(curl -sf \
  -H "apikey: $SUPABASE_SERVICE_KEY" \
  "$SUPABASE_URL/rest/v1/agent_sessions?status=eq.active&worker_env=eq.bms4&select=count" \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(d[0]['count'])")
if [ "$ACTIVE" -ge 4 ]; then
  echo "At capacity ($ACTIVE/4 agents). Backing off."
  exit 0
fi

4. Escalation Rules

Escalation Triggers

Condition	Tier	Action
Tier 1 service down (WAHA, Atrax GPS, Pinbox24 API, Supabase)	Immediate	Discord + GitHub issue comment + `human-action` label
MongoDB rs0 has no PRIMARY	Immediate	Discord + GitHub issue
bms-1 disk > 95%	Immediate	Discord + GitHub issue
Security issue (CVE, auth anomaly, credential expiry)	Immediate	Discord + GitHub issue + `human-action`
Issue cannot be diagnosed (SSH unreachable, auth failure)	15 min	Discord warning + GitHub comment
Auto-fix attempted but failed twice	30 min	Discord + `human-action` label
Agent OAuth expired	N/A	System cannot self-notify — GitHub Actions health-check backstop handles this

Discord Notification Format

# Discord alert function — call this from within the agent or nightly script
send_discord_alert() {
  local SEVERITY="$1"   # "CRITICAL" | "WARNING" | "INFO"
  local TITLE="$2"
  local DESCRIPTION="$3"
  local ISSUE_URL="${4:-}"
 
  case "$SEVERITY" in
    CRITICAL) COLOR=15158332 ;;  # red
    WARNING)  COLOR=16776960 ;;  # yellow
    INFO)     COLOR=3066993  ;;  # green
    *)        COLOR=9807270  ;;  # grey
  esac
 
  PAYLOAD=$(jq -nc \
    --arg title "[$SEVERITY] $TITLE" \
    --arg desc "$DESCRIPTION" \
    --arg url "$ISSUE_URL" \
    --argjson color "$COLOR" \
    '{embeds: [{title: $title, description: $desc, url: $url, color: $color}]}')
 
  curl -s -X POST "$DISCORD_WEBHOOK_URL" \
    -H "Content-Type: application/json" \
    -d "$PAYLOAD" || true
}
 
# Usage examples:
send_discord_alert "CRITICAL" \
  "AI-Dev-BMS4: WAHA session FAILED" \
  "WhatsApp gateway session not WORKING. GPS incidents cannot be received. Manual restart required.\nHost: waha2.vps-h1.infra.zintegrowana.online" \
  "https://github.com/radieu/p24-infra/issues/123"
 
send_discord_alert "WARNING" \
  "AI-Dev-BMS4: bms-3 RAM at 87%" \
  "MongoDB on bms-3 consuming 21.7 GB. Available RAM below safe threshold.\nConsider planned mongod restart during low-traffic window."

GitHub Issue Labeling on Escalation

# When escalating an issue to human:
gh issue edit "$ISSUE_NUMBER" \
  --repo radieu/p24-infra \
  --add-label "human-action"
 
gh issue comment "$ISSUE_NUMBER" \
  --repo radieu/p24-infra \
  --body "$(cat << 'EOF'
## AI-Dev-BMS4 Escalation Report
 
**Timestamp:** $(date -u +%Y-%m-%dT%H:%M:%SZ)
**Agent:** AI-Dev-BMS4 (bms-4, 54.36.123.110)
**Reason for escalation:** <specific reason>
 
### Diagnostics performed
<list of commands run and their outputs>
 
### Root cause assessment
<what the agent determined>
 
### Recommended human action
<specific steps for the human operator>
 
### Cannot proceed because
<specific blocker — e.g., "Requires MongoDB admin password", "Requires physical server access">
EOF
)"

5. Agent Configuration

CLAUDE.md for bms-4 agent

Place at /home/claude-runner/p24-infra/.claude/CLAUDE.md (overrides repo-level for this agent instance):

# CLAUDE.md — AI-Dev-BMS4 Nightly Agent
 
## Agent Identity
- **Label:** AI-Dev-BMS4
- **Host:** bms-4 (54.36.123.110, Ubuntu 22.04)
- **Role:** Nightly p24-infra issue processing + MongoDB maintenance
- **Max parallel agents:** 4
- **Active window:** 02:00–06:00 UTC
 
## Primary Tasks
1. Pick up GitHub issues in `radieu/p24-infra` with labels: `triage`, `server-down`,
   `infra-check-fail`, `atrax-stale`, `failed-gh-actions`
2. Run /process-issues skill for triage/design/implementation
3. MongoDB rs0 health check commands (read-only mongosh)
4. Disk usage checks on all servers via SSH
5. Escalate unsolvable issues via Discord + human-action label
 
## Permissions
- Docker commands: ALLOWED on bms-4 only (this server)
- SSH diagnostics: ALLOWED to vps-i1, vps-h1, bms-2, bms-3 (read-only)
- GitHub PRs: ALWAYS target `dev` branch
- MongoDB: READ-ONLY — `rs.status()`, `db.serverStatus()`, profiler queries only
- NEVER: write to MongoDB, restart containers on remote servers, push to main
 
## Environment
Source `~/.claude-env` before any command requiring GITHUB_TOKEN or DISCORD_WEBHOOK_URL.
 
## Error Reporting
All errors → Discord via DISCORD_WEBHOOK_URL + GitHub issue comment

.env.local secrets required on bms-4

The agent’s .claude-env (at /home/claude-runner/.claude-env) must contain:

Variable	Source in .env.local	Purpose
`GITHUB_TOKEN`	`P24_INFRA_GH_TOKEN`	gh CLI authentication
`GH_TOKEN`	same	alias for gh
`DISCORD_WEBHOOK_URL`	`P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL`	alerts
`SUPABASE_URL`	`SUPABASE_URL`	agent_sessions coordination
`SUPABASE_SERVICE_KEY`	`SUPABASE_SERVICE_KEY`	insert/read agent_sessions
`PROMETHEUS_URL`	`http://217.154.82.162:9090`	metrics queries

The MongoDB admin password is NOT stored on bms-4 — the agent runs only read-only mongosh commands using the keyFile (internal cluster auth), not admin credentials.

systemd service file (alternative to cron)

For more reliable execution than cron:

# /etc/systemd/system/bms4-nightly.service
[Unit]
Description=AI-Dev-BMS4 Nightly p24-infra Issue Processing
After=network.target
 
[Service]
Type=oneshot
User=root
ExecStart=/root/bms4-nightly.sh
StandardOutput=append:/var/log/bms4-nightly.log
StandardError=append:/var/log/bms4-nightly.log
TimeoutStartSec=3600

# /etc/systemd/system/bms4-nightly.timer
[Unit]
Description=AI-Dev-BMS4 Nightly Timer
Requires=bms4-nightly.service
 
[Timer]
OnCalendar=*-*-* 02:05:00 UTC
Persistent=true
 
[Install]
WantedBy=timers.target

systemctl daemon-reload
systemctl enable bms4-nightly.timer
systemctl start bms4-nightly.timer
systemctl list-timers bms4-nightly.timer

Part 2: Nightly Operations

6. Nightly Operations Schedule

All times UTC. AI-Dev-BMS4 orchestrates the sequence starting at 02:05 UTC after the initial GitHub Actions health checks at 02:00.

Time (UTC)	Task	Responsible	Method
02:00	Health check — all Tier 1 services	GitHub Actions `health-check.yml` (every 2h)	GH Actions on ionos runner
02:00	Atrax GPS freshness check	GitHub Actions `atrax-data-freshness.yml` (every 10 min)	GH Actions on ionos runner
02:05	AI-Dev-BMS4 wakes up — picks up any open issues	AI-Dev-BMS4 cron	bms-4
02:10	Disk usage check — all 5 servers	AI-Dev-BMS4	SSH + df -h
02:30	MongoDB rs0 health check + replication metrics	AI-Dev-BMS4	mongosh on bms-3 via SSH
02:30	DB maintenance VACUUM (Mon–Sat)	GitHub Actions `db-maintenance.yml`	GH Actions on ionos runner
03:00	Docker container audit — all hosts	AI-Dev-BMS4	SSH + docker ps
03:00	DB weekly maintenance (Sunday only — REINDEX)	GitHub Actions `db-maintenance.yml`	GH Actions on ionos runner
03:30	SSL certificate expiry check	AI-Dev-BMS4	curl + openssl
04:00	Trivy CVE scan	GitHub Actions `trivy-scan.yml`	GH Actions on ionos runner
04:00	n8n SQLite maintenance (Sunday only)	GitHub Actions `n8n-maintenance.yml`	GH Actions on ionos runner
04:30	Security log review (fail2ban, auth attempts)	AI-Dev-BMS4	SSH + log analysis
05:00	Morning summary report — Discord	AI-Dev-BMS4	Discord webhook
05:00	p24-infra issue pipeline (triage + design)	IONOS cron `p24-infra-nightly.sh`	vps-i1
05:30	Close resolved issues, tag remaining	AI-Dev-BMS4	gh CLI

Time Budget (bms-4 resources, 32 GB RAM)

Phase	Duration	Claude agents	RAM usage
Issue pickup + triage	02:05–03:30	1–4	2.8–5.6 GB
MongoDB + disk checks	02:10–02:40	0 (scripts)	<100 MB
Security + SSL	03:30–04:30	1–2	1.4–2.8 GB
Report generation	04:30–05:30	1	700 MB

Total RAM budget: 30 GB free → well within limits even with 4 agents.

7. Tier 1 Critical Service Checks

These checks run at 02:00 UTC via health-check.yml and every 10 minutes for Atrax. AI-Dev-BMS4 supplements with deeper diagnostics when alerts are raised.

7.1 Atrax GPS Sync (n8n workflow)

What it is: n8n workflow ID AJ1px9uHIfbsriof syncs GPS tracking data from Atrax API to Supabase p24_gps_current_state table every 5 minutes.

Check command (run by GitHub Actions every 10 min):

# Check freshness — stale if last sync > 10 minutes ago
RESPONSE=$(curl -sf \
  -H "apikey: $SUPABASE_SERVICE_KEY" \
  "$SUPABASE_URL/rest/v1/p24_gps_current_state?select=n8n_synced_at&order=n8n_synced_at.desc&limit=1")
LAST=$(echo "$RESPONSE" | jq -r '.[0].n8n_synced_at // empty')
AGE=$(( $(date -u +%s) - $(date -u -d "$LAST" +%s) ))
[ "$AGE" -lt 600 ] && echo "OK: ${AGE}s" || echo "STALE: ${AGE}s"

Success criterion: n8n_synced_at within last 600 seconds (10 min)

Failure action (AI-Dev-BMS4):

# 1. Check n8n container health on bms-4 (post-migration) or vps-h1 (pre-migration)
ssh root@54.36.123.110 "docker inspect root-n8n-1 --format '{{.State.Status}}'"
 
# 2. Check last execution in n8n API
curl -s -H "X-N8N-API-KEY: $N8N_API_KEY" \
  "https://n8n.bms-4.infra.zintegrowana.online/api/v1/workflows/AJ1px9uHIfbsriof/executions?limit=3" \
  | jq '.data[] | {finished, status, startedAt}'
 
# 3. If container down: alert immediately (cannot auto-restart — requires n8n workflow logic)
# 4. If workflow stuck: post GitHub issue with label "atrax-stale"

SLA: Data must not be stale > 30 minutes. Page immediately at 30 min stale.

7.2 Docker Daemon Health — bms-1 and bms-3

What it is: Pinbox24 production (bms-1) and staging (bms-3) run on Docker. Daemon down = all containers down.

Check commands:

# bms-1 (Pinbox24 production) — SSH as root
ssh -i ~/.ssh/id_bms4_agent -o ConnectTimeout=10 root@94.23.26.113 \
  "systemctl is-active docker && docker ps --format '{{.Names}}' | wc -l"
 
# bms-3 (Pinbox24 staging + MongoDB primary)
ssh ubuntu@51.68.155.224 \
  "systemctl is-active docker && docker ps --format '{{.Names}}' | grep -c 'Up'"

Success criterion: docker service = active, container count matches expected (bms-1: ~24, bms-3: ~11)

Failure action: Immediate Discord CRITICAL alert. Cannot auto-remediate — escalate with human-action label.

7.3 WAHA WhatsApp Gateway

What it is: WAHA container on vps-h1 receives WhatsApp messages for incident management. Session must be in WORKING state (not just container running).

Check command:

# Server liveness + session state (both required)
SERVER=$(curl -s -o /dev/null -w "%{http_code}" --max-time 15 \
  -H "X-Api-Key: $WAHA_API_KEY" \
  "https://waha2.vps-h1.infra.zintegrowana.online/api/server/status")
 
SESS_STATUS=$(curl -s --max-time 15 \
  -H "X-Api-Key: $WAHA_API_KEY" \
  "https://waha2.vps-h1.infra.zintegrowana.online/api/sessions/default" \
  | python3 -c 'import sys,json; print(json.load(sys.stdin).get("status","?"))' 2>/dev/null)
 
echo "server=$SERVER session=$SESS_STATUS"
[ "$SERVER" = "200" ] && [ "$SESS_STATUS" = "WORKING" ] && echo "OK" || echo "FAIL"

Success criterion: HTTP 200 AND session status = WORKING

Important: Server returning HTTP 200 does NOT mean session is healthy (proven during 2026-05-23 blackout). Always check session status separately.

Failure action:

If container down: check ssh root@72.60.32.61 "docker ps | grep waha"
If session FAILED/STOPPED: attempt session restart via WAHA API: POST /api/sessions/default/restart
If restart fails: trigger waha-session-restart.yml via gh workflow run
If persistent: escalate with human-action — may require re-scan of QR code

7.4 MongoDB rs0 Replica Set Health

What it is: Three-member replica set (bms-3 PRIMARY, bms-2 SECONDARY observer, bms-4 ARBITER). Requires at least 2 of 3 members for election quorum.

Check command:

# Run via SSH to bms-3 (which has cluster auth access)
# Uses keyFile internal auth — no admin password needed for rs.status()
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  const s = rs.status();
  const members = s.members.map(m => ({name: m.name, state: m.stateStr, health: m.health}));
  printjson({set: s.set, ok: s.ok, members: members});
'"

Success criteria:

Exactly 1 member in PRIMARY state
health: 1 for all members
Replication lag on bms-2 < 60 seconds
No member in RECOVERING, DOWN, or UNKNOWN state

Failure action: See Section 18 — Failover Runbook

7.5 Pinbox24 Production API

What it is: The production fleet management API endpoint.

Check command:

STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  --connect-timeout 10 --max-time 20 \
  "https://api.w4.pinbox24.com/api/")
[ "$STATUS" = "200" ] || [ "$STATUS" = "401" ] \
  && echo "API reachable (HTTP $STATUS)" \
  || echo "API FAIL (HTTP $STATUS)"

Success criterion: HTTP 200 or 401 (401 = API running but unauthenticated, which is correct)

Failure action: Immediate Discord CRITICAL. Check bms-1 Docker daemon. Cannot auto-remediate.

7.6 Supabase Connectivity and Queue Depths

What it is: Supabase is the primary database and coordination hub. Queue depth metrics reveal processing backlogs.

Check command:

# Connectivity
HTTP=$(curl -s -o /dev/null -w "%{http_code}" \
  "$SUPABASE_URL/rest/v1/agent_sessions?select=count&limit=0" \
  -H "apikey: $SUPABASE_SERVICE_KEY" --max-time 10)
echo "Supabase HTTP: $HTTP"
 
# Queue depths — check for backlogs
curl -sf \
  -H "apikey: $SUPABASE_SERVICE_KEY" \
  "$SUPABASE_URL/rest/v1/pending_transcriptions?select=count" \
  | jq '.[0].count // 0' | xargs -I{} echo "pending_transcriptions: {}"
 
curl -sf \
  -H "apikey: $SUPABASE_SERVICE_KEY" \
  "$SUPABASE_URL/rest/v1/pending_pdf_processing?select=count" \
  | jq '.[0].count // 0' | xargs -I{} echo "pending_pdf_processing: {}"

Thresholds: pending_transcriptions > 100 or pending_pdf_processing > 50 = warning alert.

7.7 Disk Usage — Critical Servers

What it is: bms-1 is already at 100% disk capacity. All servers need monitoring.

Check commands:

# bms-1 (Pinbox24 production) — CRITICAL (already at 100%)
ssh root@94.23.26.113 "df -h / | tail -1"
 
# bms-3 (staging + MongoDB primary)
ssh ubuntu@51.68.155.224 "df -h / | tail -1"
 
# vps-i1 (monitoring)
ssh claude-admin@217.154.82.162 "df -h / | tail -1"
 
# vps-h1 (n8n + WAHA)
ssh root@72.60.32.61 "df -h / | tail -1"
 
# bms-4 (self — this server)
df -h / | tail -1

Thresholds:

Server	Warning	Critical	Action
bms-1	90%	95%	Immediate escalation — disk already full
bms-3	70%	80%	Docker prune + old log cleanup
vps-i1	80%	90%	Docker image prune
vps-h1	75%	85%	Docker prune
bms-4	60%	75%	n8n data + Docker prune

Cleanup commands for bms-3 (auto-applicable):

ssh ubuntu@51.68.155.224 "
  # Remove unused Docker images (keeps running containers)
  docker image prune -f
  # Remove stopped containers and unused networks
  docker container prune -f
  # Truncate old logs (keep last 100 MB)
  find /var/log -name '*.log' -size +100M -exec truncate -s 100M {} \;
"

8. Tier 2 Platform Checks

These run nightly by AI-Dev-BMS4 at 03:00–04:00 UTC. Failure triggers a 30-minute response window.

8.1 et-operational-platform Vercel Health

# Check production deployment
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  --max-time 15 "${VERCEL_ETOP_PROD_URL}/api/health" 2>/dev/null \
  || echo "000")
echo "Vercel production: HTTP $STATUS"
 
# Check for recent failed deployments via Vercel API
curl -sf "https://api.vercel.com/v6/deployments?limit=5&target=production" \
  -H "Authorization: Bearer $VERCEL_TOKEN" \
  | jq '.deployments[] | {url, state, created: (.createdAt | todate)}'

Thresholds: Any ERROR state deployment in last 24h → warning issue.

8.2 Grafana + Prometheus on vps-i1

# Grafana
curl -sf --max-time 10 \
  "https://grafana.vps-i1.infra.zintegrowana.online/api/health" \
  | jq .database
 
# Prometheus
curl -sf --max-time 10 \
  "http://217.154.82.162:9090/-/healthy" && echo "Prometheus healthy"
 
# Check Prometheus targets (any DOWN?)
curl -sf "http://217.154.82.162:9090/api/v1/targets" \
  | jq '[.data.activeTargets[] | select(.health != "up")] | length' \
  | xargs -I{} echo "Unhealthy targets: {}"

8.3 n8n Workflow Execution Failures

# Check recent executions — failures in last 24h
curl -sf \
  -H "X-N8N-API-KEY: $N8N_API_KEY" \
  "https://n8n.bms-4.infra.zintegrowana.online/api/v1/executions?status=error&limit=20" \
  | jq '.data | length' | xargs -I{} echo "Failed executions (24h): {}"
 
# Get workflow names of failures
curl -sf \
  -H "X-N8N-API-KEY: $N8N_API_KEY" \
  "https://n8n.bms-4.infra.zintegrowana.online/api/v1/executions?status=error&limit=10" \
  | jq '.data[] | {workflow: .workflowData.name, startedAt}'

Threshold: > 5 failed executions in 24h → GitHub issue with label n8n-errors.

8.4 SSL Certificate Expiry

# Check all public-facing domains — warn if < 14 days
DOMAINS=(
  "grafana.vps-i1.infra.zintegrowana.online"
  "n8n.bms-4.infra.zintegrowana.online"
  "waha2.vps-h1.infra.zintegrowana.online"
  "traccar.vps-i1.infra.zintegrowana.online"
)
 
for domain in "${DOMAINS[@]}"; do
  EXPIRY=$(echo | openssl s_client -servername "$domain" -connect "${domain}:443" 2>/dev/null \
    | openssl x509 -noout -enddate 2>/dev/null \
    | cut -d= -f2)
  DAYS=$(( ( $(date -d "$EXPIRY" +%s) - $(date +%s) ) / 86400 ))
  if [ "$DAYS" -lt 14 ]; then
    echo "WARNING: $domain cert expires in $DAYS days"
  else
    echo "OK: $domain expires in $DAYS days"
  fi
done

Threshold: < 14 days → warning issue. < 7 days → CRITICAL, immediate Discord alert.

8.5 Memory Usage — bms-3 MongoDB Risk

# bms-3 is most at risk: MongoDB using 21.7 GB of 32 GB total
ssh ubuntu@51.68.155.224 "
  echo '--- Memory ---'
  free -h
  echo '--- MongoDB process ---'
  ps aux --sort=-%mem | grep mongod | head -3
  echo '--- Docker containers ---'
  docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}' | head -15
"

Threshold: Available memory < 4 GB on bms-3 → WARNING (MongoDB OOM risk).

8.6 Traefik Routing Health on bms-4

# Check Traefik dashboard API (not public)
ssh root@54.36.123.110 "
  docker exec root-traefik-1 wget -qO- http://localhost:8080/api/rawdata 2>/dev/null \
  | python3 -c \"import sys,json; d=json.load(sys.stdin); print('routers:', len(d.get('routers', {})))\"
  docker inspect root-traefik-1 --format '{{.State.Status}}'
"

9. Tier 3 Quality Checks

These run nightly and generate items for the morning report. No immediate alerting — create GitHub issues for tracking.

9.1 Supabase Slow Query Report

# Top 10 slowest queries by mean execution time
psql "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF sslmode=require dbname=postgres" \
  -c "SELECT
    LEFT(query, 80) AS query_snippet,
    calls,
    ROUND(mean_exec_time::numeric, 2) AS mean_ms,
    ROUND(total_exec_time::numeric, 2) AS total_ms
  FROM pg_stat_statements
  WHERE mean_exec_time > 100
  ORDER BY mean_exec_time DESC
  LIMIT 10;"

9.2 Docker Image CVE Summary

Handled by trivy-scan.yml at 04:00 UTC. AI-Dev-BMS4 checks for open Trivy issues:

gh issue list --repo radieu/p24-infra --label security --state open \
  --json number,title,createdAt | jq '.[] | select(.title | contains("Trivy"))'

9.3 Backup Freshness Verification

# Check Wasabi backup status JSON (written by backup-exporter)
curl -sf "http://217.154.82.162:9220/metrics" \
  | grep "backup_last_successful_timestamp" \
  | awk '{print $1, strftime("%Y-%m-%d %H:%M", $2)}'
 
# Check n8n backup freshness (last Sunday)
python3 - << 'EOF'
import boto3, certifi, datetime, os
s3 = boto3.client("s3",
    endpoint_url="https://s3.eu-central-2.wasabisys.com",
    region_name="eu-central-2",
    verify=certifi.where()
)
objs = s3.list_objects_v2(Bucket="p24-infra", Prefix="n8n/")
if objs.get("Contents"):
    latest = max(objs["Contents"], key=lambda x: x["LastModified"])
    age = (datetime.datetime.now(datetime.timezone.utc) - latest["LastModified"]).days
    print(f"n8n backup: {latest['Key']} ({age} days ago)")
EOF

9.4 GitHub Actions CI/CD Pipeline Health

# Check failed scheduled workflows in last 24h
gh run list --repo radieu/p24-infra \
  --status failure \
  --created "$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --json workflowName,createdAt,url \
  | jq '.[] | {workflow: .workflowName, time: .createdAt, url}'

9.5 MongoDB Slow Query Log

# Check MongoDB profiler on bms-3 for queries > 100ms (last 24h)
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  db = db.getSiblingDB(\"admin\");
  // Check profiler level (should be 1 = slow ops only, or 2 = all)
  print(\"profiler:\", JSON.stringify(db.getProfilingStatus()));
  
  // Query system.profile for recent slow operations
  db.getSiblingDB(\"local\").system.profile.find(
    { millis: { \$gte: 100 }, ts: { \$gte: new Date(Date.now() - 86400000) } },
    { ns: 1, millis: 1, ts: 1, op: 1 }
  ).sort({ millis: -1 }).limit(10).forEach(printjson);
'"

10. Supabase Maintenance

Nightly VACUUM (02:30 UTC, Mon–Sat)

Handled by db-maintenance.yml. This workflow connects via the Supabase pooler (IPv4-compatible) on port 5432 (session mode — required for VACUUM).

# Command reference for manual execution:
PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
  "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" \
  -c "VACUUM ANALYZE;"

Weekly REINDEX (Sunday 03:00 UTC)

# REINDEX CONCURRENTLY — does not block reads/writes
PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
  "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" \
  -c "REINDEX DATABASE CONCURRENTLY postgres;"

Monthly Stats Reset

# Reset pg_stat_statements counters monthly to avoid stale data
PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
  "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" \
  -c "SELECT pg_stat_statements_reset();"

Table Bloat Check

PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
  "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" << 'SQL'
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
  pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)
    - pg_relation_size(schemaname||'.'||tablename)) AS index_size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
SQL

11. n8n Workflow Health

Key Workflows to Monitor

Workflow	ID	Schedule	Criticality
Atrax GPS Sync	`AJ1px9uHIfbsriof`	Every 5 min	Tier 1
WAHA incident routing	(varies)	Event-triggered	Tier 1
Daily GPS report	(varies)	06:00 UTC	Tier 2
Supabase queue drain	(varies)	Continuous	Tier 2

n8n Health Check Script

N8N_BASE="https://n8n.bms-4.infra.zintegrowana.online"
 
# Container health
ssh root@54.36.123.110 "docker inspect --format '{{.State.Health.Status}}' root-n8n-1"
 
# API health endpoint
curl -sf "${N8N_BASE}/healthz" && echo "n8n healthy"
 
# Active workflows count
curl -sf -H "X-N8N-API-KEY: $N8N_API_KEY" \
  "${N8N_BASE}/api/v1/workflows?active=true" \
  | jq '.data | length' | xargs -I{} echo "Active workflows: {}"
 
# Recent failures
curl -sf -H "X-N8N-API-KEY: $N8N_API_KEY" \
  "${N8N_BASE}/api/v1/executions?status=error&limit=5" \
  | jq '.data[] | {workflow: .workflowData.name, startedAt, status}'

Atrax Workflow Restart Procedure

If Atrax GPS sync is stale but n8n is healthy:

# Manually trigger the Atrax GPS sync workflow via n8n API
curl -X POST \
  -H "X-N8N-API-KEY: $N8N_API_KEY" \
  -H "Content-Type: application/json" \
  "${N8N_BASE}/api/v1/workflows/AJ1px9uHIfbsriof/activate" \
  -d '{}'
 
# If workflow is already active but not executing, trigger a test run
curl -X POST \
  -H "X-N8N-API-KEY: $N8N_API_KEY" \
  -H "Content-Type: application/json" \
  "${N8N_BASE}/api/v1/workflows/AJ1px9uHIfbsriof/test" \
  -d '{"pinData":{}}'

12. Disk Usage Monitoring

Per-Server Thresholds and Cleanup Procedures

bms-1 (Pinbox24 production — CRITICAL, already at 100%)

ssh root@94.23.26.113 "
  echo '=== Disk usage ==='
  df -h /
  echo '=== Top space consumers ==='
  du -sh /var/lib/docker /var/log /tmp 2>/dev/null | sort -rh | head -10
  echo '=== Docker disk usage ==='
  docker system df
"

Auto-cleanup (safe on bms-1):

ssh root@94.23.26.113 "
  # Remove stopped containers (not running Pinbox24)
  docker container prune -f
  # Remove dangling images only (NOT all unused — avoid removing p24 images)
  docker image prune -f
  # Truncate large log files
  find /var/log -name '*.log' -size +50M -exec truncate -s 10M {} \;
  # Clear /tmp
  find /tmp -mtime +7 -delete 2>/dev/null || true
"

Alert threshold: At 100% — any write failure could take down production. Alert at ANY usage > 95%.

bms-3 (staging + MongoDB)

ssh ubuntu@51.68.155.224 "
  df -h /
  du -sh /var/lib/mongodb /var/lib/docker 2>/dev/null
"

Auto-cleanup:

ssh ubuntu@51.68.155.224 "
  docker image prune -f
  docker container prune -f
  sudo journalctl --vacuum-time=7d
"

Alert threshold: > 70% warning, > 80% critical.

bms-4 (this server)

df -h /
du -sh /var/lib/docker /var/lib/mongodb /home/claude-runner 2>/dev/null
docker system df

Alert threshold: > 60% warning (currently at 1% — 1.7 TB free).

13. Security Nightly Checks

13.1 fail2ban Status

# Check banned IPs on all servers
for host in root@94.23.26.113 ubuntu@51.68.155.224 root@54.36.123.110 root@72.60.32.61; do
  echo "=== $host ==="
  ssh -o ConnectTimeout=10 "$host" \
    "fail2ban-client status sshd 2>/dev/null | grep -E 'Total banned|Currently banned'" \
    2>/dev/null || echo "fail2ban not running or unreachable"
done

Alert threshold: > 50 new bans in 24h = potential brute-force attack in progress.

13.2 SSH Auth Log Review

# Failed SSH login attempts (last 24h) on each server
for host in ubuntu@51.68.155.224 root@54.36.123.110; do
  echo "=== $host failed SSH ==="
  ssh -o ConnectTimeout=10 "$host" \
    "grep 'Failed password\|Invalid user\|Authentication failure' /var/log/auth.log 2>/dev/null \
     | grep \"$(date -u --date='24 hours ago' '+%b %d')\\|$(date -u '+%b %d')\" \
     | wc -l" 2>/dev/null | xargs -I{} echo "Failed attempts: {}"
done

Alert threshold: > 200 failed attempts/24h on any single server.

13.3 SSL Certificate Expiry

Covered in Section 8.4. Run nightly — auto-creates GitHub issue if < 14 days.

13.4 Credential Rotation Check

# Check credential-exporter metrics for rotation age
curl -sf "http://217.154.82.162:9210/metrics" \
  | grep "credential_age_days" \
  | sort -t= -k2 -rn \
  | head -10

Alert threshold: Any credential older than 80 days (rotation should be every 90 days).

13.5 MongoDB Unauthorized Access Attempts

# Check MongoDB logs for auth failures (bms-3 — likely PRIMARY)
ssh ubuntu@51.68.155.224 "
  grep -i 'authentication\|authorization.*failed\|Unauthorized' \
    /var/log/mongodb/mongod.log \
    | tail -50 \
    | grep \"$(date -u '+%Y-%m-%d')\"
"

Part 3: MongoDB rs0 Maintenance

14. Replica Set Health Dashboard

Key Prometheus Metrics (PromQL)

The following metrics are scraped via the future mongodb-exporter integration. Until that exporter is deployed, use the mongosh commands below.

Replication lag (secondary behind primary):

# Once mongodb-exporter is deployed:
mongodb_rs_member_optime_date{state="SECONDARY"} - on() mongodb_rs_member_optime_date{state="PRIMARY"}
 
# Threshold alert: lag > 60s

Oplog window (how long before secondary falls off the oplog):

mongodb_mongod_replset_oplog_head_timestamp - mongodb_mongod_replset_oplog_tail_timestamp
# Healthy: > 24h (86400s)
# Warning: < 4h

Active connections:

mongodb_connections{state="current"}
# Alert if > 80% of maxConnections

RAM usage (node-level proxy metric):

# Until mongodb-exporter available, use node_memory_MemAvailable_bytes on bms-3
node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"} / 1024^3
# Alert: available < 4 GB

mongosh Health Commands

Run these on bms-3 (PRIMARY) via SSH. Uses internal cluster auth (keyFile) — no admin password needed for status queries:

// Connect to bms-3 as cluster member (keyFile auth handles internal auth automatically)
// Run: ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '...'"
 
// === FULL REPLICA SET STATUS ===
rs.status()
 
// === CONCISE MEMBER STATUS ===
rs.status().members.forEach(m => {
  const lag = m.lastHeartbeatMessage || '';
  const optime = m.optime ? m.optime.ts : 'N/A';
  print(`${m.name} | ${m.stateStr} | health:${m.health} | ${optime}`);
});
 
// === REPLICATION LAG ===
// Get PRIMARY and SECONDARY optimes, compute lag in seconds
const status = rs.status();
const primary = status.members.find(m => m.stateStr === 'PRIMARY');
const secondaries = status.members.filter(m => m.stateStr === 'SECONDARY');
secondaries.forEach(s => {
  const lag = primary.optimeDate - s.optimeDate;
  print(`${s.name} lag: ${lag / 1000}s`);
});
 
// === OPLOG WINDOW ===
use local;
const oplog = db.oplog.rs;
const head = oplog.find().sort({$natural: -1}).limit(1).next();
const tail = oplog.find().sort({$natural: 1}).limit(1).next();
const windowHours = (head.ts.t - tail.ts.t) / 3600;
print(`Oplog window: ${windowHours.toFixed(1)} hours`);
print(`Oplog size: ${db.runCommand({dbStats: 1, freeStorage: 0}).storageSize / 1024 / 1024 / 1024} GB`);
 
// === CONNECTION STATS ===
use admin;
const cs = db.serverStatus().connections;
print(`Connections: current=${cs.current} available=${cs.available} totalCreated=${cs.totalCreated}`);
 
// === SLOW QUERIES (profiler, last 100) ===
use admin;
db.setProfilingLevel(1, { slowms: 100 });  // Set if not already set
db.getSiblingDB("local").system.profile.find(
  { millis: { $gte: 100 } }
).sort({ ts: -1 }).limit(20).forEach(p => {
  print(`${p.ts.toISOString()} | ${p.op} | ${p.ns} | ${p.millis}ms`);
});

Grafana Dashboard Panels (to create)

Panel	Metric/Query	Visualization
rs0 member states	`rs.status()` via scraper	Status map (3 nodes)
Replication lag	`mongodb_rs_member_optime_date` diff	Time series
Oplog window	head - tail timestamp	Stat panel
bms-3 RAM available	`node_memory_MemAvailable_bytes`	Gauge
bms-3 disk usage	`node_filesystem_avail_bytes`	Gauge
MongoDB connections	`mongodb_connections{state="current"}`	Time series

15. Regular Maintenance Schedule

Every 15 Minutes (continuous monitoring)

Handled by Prometheus + Alertmanager (automated). No agent action needed unless alert fires.

rs.status() health check (via future mongodb-exporter scrape)
Replication lag check
Connection count check

Daily (02:30 UTC — nightly ops window)

AI-Dev-BMS4 executes:

Quick rs.status() check via SSH
Check replication lag < 60 seconds
Verify all 3 members visible and healthy
Check MongoDB log for errors in last 24h
Verify arbiter (bms-4) is in ARBITER state (not DOWN)

# Daily MongoDB health check script
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  const s = rs.status();
  const healthy = s.members.filter(m => m.health === 1).length;
  const primary = s.members.find(m => m.stateStr === \"PRIMARY\");
  if (!primary) { print(\"CRITICAL: No PRIMARY\"); quit(1); }
  if (healthy < 2) { print(\"CRITICAL: Only \" + healthy + \" healthy members\"); quit(1); }
  print(\"rs0 OK: PRIMARY=\" + primary.name + \" healthy=\" + healthy + \"/\" + s.members.length);
'"

Weekly (Sunday 03:00 UTC)

Oplog size and window analysis
Index statistics — identify unused indexes
Collection statistics — size, count, fragmentation
Profiler slow query report
Review MongoDB error log summary

# Weekly MongoDB analysis — run on bms-3 as PRIMARY
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  use local;
  const oplog = db.oplog.rs;
  const head = oplog.find().sort({$natural: -1}).limit(1).next();
  const tail = oplog.find().sort({$natural: 1}).limit(1).next();
  const windowHours = (head.ts.t - tail.ts.t) / 3600;
  print(\"=== OPLOG ===\");
  print(\"Window: \" + windowHours.toFixed(1) + \" hours\");
  const stats = db.runCommand({dbStats: 1});
  print(\"Oplog size: \" + (stats.storageSize / 1024/1024/1024).toFixed(2) + \" GB\");
  print(\"\");
  
  print(\"=== DATABASES ===\");
  db.adminCommand({listDatabases: 1}).databases.forEach(d => {
    print(d.name + \": \" + (d.sizeOnDisk / 1024/1024).toFixed(0) + \" MB\");
  });
'"

Monthly (1st Sunday of month, 03:00 UTC)

Review replica set configuration — member priorities and votes
Check keyFile md5 consistency across all members
Rolling restart assessment (needed if mongod version update pending)
Index optimization — drop unused indexes, rebuild fragmented ones
Capacity forecast — disk and RAM trend analysis
Review and rotate monitoring credentials

16. Maintenance Procedures

16.1 MongoDB Version Check and Update Assessment

# Check versions on all members
for host in "ubuntu@51.68.155.224" "ubuntu@145.239.133.104" "root@54.36.123.110"; do
  echo "=== $host ==="
  ssh -o ConnectTimeout=10 "$host" "mongod --version 2>/dev/null | head -1" 2>/dev/null
done

Current versions:

bms-3: 7.0.26
bms-2: 7.0.25
bms-4: 7.0.37

Minor version skew is acceptable within the 7.0.x series. Plan rolling update when any member falls 2+ minor versions behind the current 7.0.x stable.

16.2 Rolling Member Restart (for mongod updates)

Never restart all members simultaneously. Always follow this order to maintain service:

Restart SECONDARY member first (bms-2 observer — least impact)
Wait for SECONDARY to rejoin and catch up (lag = 0)
Restart ARBITER (bms-4) — no data, fast restart
Step down PRIMARY (bms-3) — triggers election
Wait for new PRIMARY election
Restart former PRIMARY (bms-3)

# Step 1: Restart bms-2 (SECONDARY observer)
ssh ubuntu@145.239.133.104 "sudo systemctl restart mongod"
# Wait for it to rejoin
sleep 30
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  const bms2 = rs.status().members.find(m => m.name.includes(\"145.239\"));
  print(bms2.name, bms2.stateStr, \"lag:\", (new Date() - bms2.optimeDate) / 1000 + \"s\");
'"
 
# Step 2: Restart bms-4 (ARBITER — only if joined)
ssh root@54.36.123.110 "systemctl restart mongod"
sleep 10
 
# Step 3: Step down bms-3 PRIMARY
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval 'rs.stepDown(60)'"
# Wait for election (typically < 12s)
sleep 15
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  const p = rs.status().members.find(m => m.stateStr === \"PRIMARY\");
  if (p) print(\"New PRIMARY:\", p.name); else print(\"No PRIMARY yet — wait\");
'"
 
# Step 4: Restart former PRIMARY (bms-3)
ssh ubuntu@51.68.155.224 "sudo systemctl restart mongod"
sleep 30
 
# Step 5: Verify rs0 fully healthy
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  rs.status().members.forEach(m => print(m.name, m.stateStr, \"health:\", m.health));
'"

16.3 Index Optimization

// Identify unused indexes (run on PRIMARY during weekly maintenance)
// Run: mongosh --quiet --eval '...' on bms-3
 
// For each database, check index usage stats
db.adminCommand({listDatabases: 1}).databases
  .filter(d => !['admin', 'local', 'config'].includes(d.name))
  .forEach(d => {
    const db2 = db.getSiblingDB(d.name);
    db2.getCollectionNames().forEach(coll => {
      db2.runCommand({aggregate: coll, pipeline: [
        {$indexStats: {}},
        {$match: {'accesses.ops': {$lt: 5}}}  // Used < 5 times
      ], cursor: {}}).cursor.firstBatch.forEach(idx => {
        print(`UNUSED: ${d.name}.${coll} index: ${JSON.stringify(idx.key)} ops:${idx.accesses.ops}`);
      });
    });
  });

16.4 Oplog Size Adjustment

The default oplog size on Ubuntu/MongoDB 7.0 is ~5% of free disk space or 990 MB minimum. For bms-3 (410 GB disk, ~170 GB used), this should be ~10 GB.

// Check current oplog configuration
use local;
db.oplog.rs.stats().maxSize / 1024 / 1024 / 1024 + " GB"
 
// Resize oplog if needed (requires PRIMARY, MongoDB 3.6+)
// WARNING: This requires mongod config change + restart
// Add to /etc/mongod.conf:
//   replication:
//     oplogSizeMB: 10240  # 10 GB
// Then restart mongod

Recommended oplog size: 10 GB on bms-3 (covers ~72h of operations at current write volume).

17. Backup Strategy

Backup Architecture

Backup type	Source	Destination	Schedule	Retention
mongodump (full)	bms-3 (PRIMARY)	Wasabi `p24-infra/mongodb/`	Weekly, Sunday 01:00 UTC	4 weeks
mongodump (incremental/oplog)	bms-3 (PRIMARY)	Wasabi `p24-infra/mongodb/oplog/`	Daily, 01:00 UTC	7 days
bms-2 disk snapshot	bms-2 (observer)	OVH snapshot API	Monthly	2 snapshots

Why backup from PRIMARY (bms-3) not from SECONDARY? bms-2 is designated as observer and dev environment host — its MongoDB data is kept current but not treated as the backup source. Backups run from PRIMARY to ensure the most up-to-date data.

Why not backup from arbiter (bms-4)? Arbiters store no data.

Daily Oplog Backup Script

# /root/mongodb-oplog-backup.sh on bms-3
#!/usr/bin/env bash
set -euo pipefail
DATE=$(date -u +%Y-%m-%d)
BACKUP_DIR="/tmp/mongodb-oplog-${DATE}"
S3_KEY="mongodb/oplog/oplog-${DATE}.tar.gz"
 
# Dump oplog only (last 24h of operations)
mongodump \
  --host 127.0.0.1:27017 \
  --authenticationDatabase admin \
  -u admin -p "$MONGODB_ADMIN_PASSWORD" \
  --db local \
  --collection oplog.rs \
  --query '{"ts": {"$gte": Timestamp('"$(date -u --date='25 hours ago' +%s)"', 0)}}' \
  --out "$BACKUP_DIR"
 
# Compress
tar czf "/tmp/${DATE}-oplog.tar.gz" -C "$BACKUP_DIR" .
rm -rf "$BACKUP_DIR"
 
# Upload to Wasabi
python3 - << PYEOF
import boto3, certifi, os
s3 = boto3.client("s3",
    endpoint_url="https://s3.eu-central-2.wasabisys.com",
    region_name="eu-central-2",
    verify=certifi.where()
)
with open(f"/tmp/${DATE}-oplog.tar.gz", "rb") as f:
    s3.upload_fileobj(f, "p24-infra", "${S3_KEY}")
print(f"Uploaded: s3://p24-infra/${S3_KEY}")
PYEOF
 
rm -f "/tmp/${DATE}-oplog.tar.gz"

Weekly Full Backup Script

# /root/mongodb-full-backup.sh on bms-3
#!/usr/bin/env bash
set -euo pipefail
DATE=$(date -u +%Y-%m-%d)
BACKUP_DIR="/tmp/mongodb-full-${DATE}"
S3_KEY="mongodb/full/mongodb-full-${DATE}.tar.gz"
 
# Full dump of all databases
mongodump \
  --host 127.0.0.1:27017 \
  --authenticationDatabase admin \
  -u admin -p "$MONGODB_ADMIN_PASSWORD" \
  --oplog \
  --out "$BACKUP_DIR"
 
# Compress
tar czf "/tmp/${DATE}-full.tar.gz" -C "$BACKUP_DIR" .
BACKUP_SIZE=$(du -sh "/tmp/${DATE}-full.tar.gz" | cut -f1)
rm -rf "$BACKUP_DIR"
 
# Upload to Wasabi
python3 - << PYEOF
import boto3, certifi, os, datetime
s3 = boto3.client("s3",
    endpoint_url="https://s3.eu-central-2.wasabisys.com",
    region_name="eu-central-2",
    verify=certifi.where()
)
with open(f"/tmp/${DATE}-full.tar.gz", "rb") as f:
    s3.upload_fileobj(f, "p24-infra", "${S3_KEY}")
print(f"Uploaded: s3://p24-infra/${S3_KEY} (${BACKUP_SIZE})")
PYEOF
 
rm -f "/tmp/${DATE}-full.tar.gz"
echo "Weekly backup complete: $S3_KEY ($BACKUP_SIZE)"

Backup Verification (Weekly)

# Verify last backup is restorable — extract to temp location and run mongod --dbpath
DATE=$(date -u --date='last Sunday' +%Y-%m-%d)
S3_KEY="mongodb/full/mongodb-full-${DATE}.tar.gz"
 
# Download and verify (bms-2 has spare disk and RAM)
ssh ubuntu@145.239.133.104 "
  python3 -c \"
import boto3, certifi
s3 = boto3.client('s3', endpoint_url='https://s3.eu-central-2.wasabisys.com',
    region_name='eu-central-2', verify=certifi.where())
s3.download_file('p24-infra', '${S3_KEY}', '/tmp/backup-verify.tar.gz')
print('Download OK:', s3.head_object(Bucket='p24-infra', Key='${S3_KEY}')['ContentLength'], 'bytes')
\"
  # Extract and verify structure
  tar tzf /tmp/backup-verify.tar.gz | head -20
  rm -f /tmp/backup-verify.tar.gz
  echo 'Backup structure verified'
"

18. Failover Runbook

18.1 Planned Failover (Maintenance Step-down)

Use this when taking bms-3 (PRIMARY) offline for maintenance.

Prerequisites:

Confirm bms-2 (SECONDARY) is caught up (lag = 0)
Confirm bms-4 (ARBITER) is healthy
Notify team via Discord before starting

// 1. Check pre-conditions
rs.status().members.forEach(m => print(m.name, m.stateStr, "health:", m.health));
 
// 2. Force step-down (bms-3 steps down for 120s, forcing election)
// Run on bms-3:
rs.stepDown(120);
 
// 3. Verify new PRIMARY elected
// Run on bms-2 (it may become PRIMARY, or stay SECONDARY if votes are insufficient)
// Note: bms-2 is non-voting (votes:0) — this means WITH ARBITER DOWN, 
// bms-3 and bms-2 cannot elect a new primary (need a voting member)
// bms-4 ARBITER provides the tie-breaking vote
rs.status();

Important quorum consideration: rs0 has:

bms-3: 1 vote (PRIMARY or SECONDARY)
bms-2: 0 votes (observer — non-voting)
bms-4: 1 vote (ARBITER)

Total votes: 2. Majority needed: 2. If arbiter (bms-4) is DOWN, bms-3 cannot hold PRIMARY (loses majority = only 1/2 votes). The arbiter’s presence is essential for quorum.

18.2 Emergency Failover (Primary Fails Unexpectedly)

# Step 1: Verify the situation
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval 'rs.status()'" 2>/dev/null \
  || echo "bms-3 unreachable"
 
# Step 2: Check from bms-2 perspective
ssh ubuntu@145.239.133.104 "mongosh --quiet --eval '
  const s = rs.status();
  s.members.forEach(m => print(m.name, m.stateStr, m.health));
  print(\"myState:\", s.myState);
'"
 
# Step 3: If arbiter (bms-4) is healthy and bms-3 is truly down:
# The replica set CANNOT elect bms-2 as PRIMARY because bms-2 has priority:0
# The set will be in READ-ONLY state (only 1 vote: arbiter, but arbiter cannot become primary)
# 
# RECOVERY OPTIONS:
# A) Restore bms-3 and let it rejoin as PRIMARY (preferred)
# B) Change bms-2 priority to 1 to allow it to become PRIMARY (emergency only)
#    Requires admin credentials — this is a HUMAN ACTION
 
# Step 4 (Human Action): Temporarily promote bms-2
# Run on bms-2 with admin credentials:
# cfg = rs.conf();
# cfg.members[0].priority = 1;  // adjust index for bms-2 member
# cfg.members[0].votes = 1;
# rs.reconfig(cfg);
# rs.status();  // should elect bms-2 as PRIMARY
 
# Step 5: Alert human operator immediately
source ~/.claude-env
curl -X POST "$DISCORD_WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d '{"content":"CRITICAL: MongoDB rs0 has no PRIMARY. bms-3 may be down. HUMAN ACTION REQUIRED immediately. See runbook section 18.2."}'

18.3 Arbiter Loss Recovery

If bms-4 (ARBITER) is down:

Immediate impact: rs0 can still operate (bms-3 and bms-2 form quorum if bms-3 has 1 vote and bms-2 has 0 votes — actually bms-3 alone with 1/2 votes loses quorum)
Wait: Actually with bms-2 non-voting, losing the arbiter means bms-3 has 1/2 votes (needs 2) → rs0 loses quorum and bms-3 becomes SECONDARY
Critical impact: All writes to MongoDB stop until arbiter is restored
Resolution: Restart mongod on bms-4 first — it’s the quickest recovery path

ssh root@54.36.123.110 "
  systemctl status mongod
  systemctl restart mongod
  sleep 10
  systemctl status mongod
"
 
# Verify arbiter rejoined
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  const arb = rs.status().members.find(m => m.stateStr === \"ARBITER\");
  if (arb) print(\"Arbiter OK:\", arb.name); else print(\"Arbiter NOT found\");
'"

19. Capacity Planning Guide

Current Baseline (2026-06-14)

Server	RAM Total	RAM Used	Disk Total	Disk Used	Growth Risk
bms-3 (PRIMARY)	32 GB	21.7 GB (MongoDB) + ~4 GB Docker	410 GB	170 GB (44%)	HIGH — OOM risk
bms-2 (observer)	32 GB	~2 GB (MongoDB replica)	410 GB	62 GB (16%)	LOW
bms-4 (arbiter)	32 GB	~75 MB (mongod arbiter)	1.8 TB	8.3 GB (1%)	VERY LOW

RAM Capacity Thresholds — bms-3

Available RAM	Action
> 8 GB	Normal
4–8 GB	Warning: monitor hourly
2–4 GB	Alert: plan immediate maintenance window
< 2 GB	Critical: initiate emergency mongod restart or workload migration

bms-3 RAM forecast: MongoDB 7.0 uses WiredTiger cache = 50% of RAM by default = 16 GB. Current data set appears to fill this. As data grows, MongoDB may attempt to use more. Watch wiredTigerCacheBytesInUse metric.

// Check WiredTiger cache usage
use admin;
const st = db.serverStatus().wiredTiger.cache;
print("Cache target:", st["maximum bytes configured"] / 1024/1024/1024 + " GB");
print("Cache used:", st["bytes currently in the cache"] / 1024/1024/1024 + " GB");
print("Evictions:", st["pages evicted by application threads"]);

Disk Capacity Thresholds — bms-3

MongoDB data directory + Docker images + logs on 410 GB:

MongoDB: ~150 GB estimated (oplog + data)
Docker: ~20 GB (staging images)
Available headroom: ~200 GB at 44% used

Action triggers:

60% disk (246 GB used) → run docker image prune
70% disk (287 GB used) → evaluate moving old staging versions to archive
80% disk (328 GB used) → emergency cleanup or MongoDB oplog resize

Key Metrics to Track Weekly

db.serverStatus().opcounters — query/insert/update/delete rate
Oplog window (hours) — must stay > 24h for safe secondary operations
Replication lag trend — should stay near 0
db.serverStatus().connections.current — connection count trend
WiredTiger cache eviction rate — high eviction = memory pressure

Capacity Forecast Queries

// Database size growth (run weekly, track over time)
use admin;
db.adminCommand({listDatabases: 1}).databases
  .filter(d => d.name !== 'local')
  .sort((a, b) => b.sizeOnDisk - a.sizeOnDisk)
  .forEach(d => {
    print(`${d.name}: ${(d.sizeOnDisk / 1024/1024/1024).toFixed(3)} GB`);
  });

20. Alert Definitions

Prometheus Alert Rules for MongoDB rs0

Add to monitoring/prometheus/rules/mongodb.yml:

groups:
  - name: mongodb_rs0
    interval: 60s
    rules:
 
      # ─── Connectivity ──────────────────────────────────────────────────────────
 
      - alert: MongoDBMemberDown
        expr: |
          # Port probe — until mongodb-exporter available
          probe_success{job="blackbox_tcp", instance=~".*27017.*"} == 0
        for: 5m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "MongoDB member unreachable: {{ $labels.instance }}"
          description: "MongoDB port 27017 not responding on {{ $labels.instance }}. Check if mongod is running."
 
      # ─── RAM pressure on bms-3 ─────────────────────────────────────────────────
 
      - alert: BMS3MemoryCritical
        expr: |
          node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"} / 1024^3 < 2
        for: 5m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "bms-3 available RAM < 2 GB"
          description: "MongoDB PRIMARY (bms-3) has < 2 GB available RAM. OOM risk is high. MongoDB is using ~21.7 GB. Immediate action required."
 
      - alert: BMS3MemoryWarning
        expr: |
          node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"} / 1024^3 < 4
        for: 10m
        labels:
          severity: warning
          team: infra
        annotations:
          summary: "bms-3 available RAM < 4 GB"
          description: "MongoDB PRIMARY (bms-3) memory is getting low. Available: {{ $value | humanize }}B. Plan maintenance window."
 
      # ─── Disk usage ────────────────────────────────────────────────────────────
 
      - alert: BMS3DiskWarning
        expr: |
          (node_filesystem_size_bytes{server="p4-ovh-bms-3-ns3129867", mountpoint="/"} -
           node_filesystem_avail_bytes{server="p4-ovh-bms-3-ns3129867", mountpoint="/"}) /
           node_filesystem_size_bytes{server="p4-ovh-bms-3-ns3129867", mountpoint="/"} > 0.70
        for: 15m
        labels:
          severity: warning
          team: infra
        annotations:
          summary: "bms-3 disk usage > 70%"
          description: "Disk on bms-3 is {{ $value | humanizePercentage }} full. Run docker image prune and check MongoDB oplog size."
 
      - alert: BMS1DiskCritical
        expr: |
          (node_filesystem_size_bytes{server="p4-ovh-bms-1-ns367522", mountpoint="/"} -
           node_filesystem_avail_bytes{server="p4-ovh-bms-1-ns367522", mountpoint="/"}) /
           node_filesystem_size_bytes{server="p4-ovh-bms-1-ns367522", mountpoint="/"} > 0.95
        for: 5m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "bms-1 (Pinbox24 PRODUCTION) disk > 95%"
          description: "Pinbox24 production server disk at {{ $value | humanizePercentage }}. Writes may fail. EMERGENCY: immediate cleanup required."
 
      # ─── Replication (once mongodb-exporter is deployed) ──────────────────────
 
      - alert: MongoDBReplicationLagHigh
        expr: |
          # Placeholder — replace with actual mongodb-exporter metric
          # mongodb_rs_member_optime_date{state="SECONDARY"} - on() mongodb_rs_member_optime_date{state="PRIMARY"} > 60
          absent(up{job="mongodb"}) == 1
        for: 5m
        labels:
          severity: warning
          team: infra
        annotations:
          summary: "MongoDB exporter not yet deployed"
          description: "Deploy mongodb-exporter to get replication lag metrics. Manual check: ssh ubuntu@51.68.155.224 mongosh --eval 'rs.status()'"
 
      # ─── Blackbox TCP probes (to create in prometheus.yml) ───────────────────
 
      # Add to monitoring/prometheus/prometheus.yml blackbox job:
      # - targets:
      #   - 51.68.155.224:27017   # bms-3 MongoDB
      #   - 145.239.133.104:27017  # bms-2 MongoDB
      #   - 54.36.123.110:27017   # bms-4 MongoDB arbiter
      #   labels: { job: blackbox_tcp, module: tcp_connect }

Alertmanager Routing for MongoDB

Add to monitoring/alertmanager/config.yml:

routes:
  - match:
      team: infra
      severity: critical
    receiver: discord-critical
    repeat_interval: 30m
    continue: true
  - match:
      team: infra
      severity: warning
    receiver: discord-warning
    repeat_interval: 4h

Manual Alert Test

# Send test alert via Alertmanager API
curl -X POST http://217.154.82.162:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "MongoDBTestAlert",
      "severity": "warning",
      "team": "infra",
      "instance": "54.36.123.110:27017"
    },
    "annotations": {
      "summary": "Test alert from runbook",
      "description": "Manual test of MongoDB alert routing"
    },
    "endsAt": "'"$(date -u -d '+5 minutes' +%Y-%m-%dT%H:%M:%SZ)"'"
  }]'

Appendix: Quick Reference Card

rs0 Member Summary

Server	IP	Port	Role	Votes	Priority
bms-3 (ns3129867)	`51.68.155.224`	27017	PRIMARY/SECONDARY	1	1
bms-2 (ns3087638)	`145.239.133.104`	27017	SECONDARY observer	0	0
bms-4 (ns3101999)	`54.36.123.110`	27017	ARBITER	1	0

Quorum: 2 votes required. bms-3 (1) + bms-4 (1) = majority. If arbiter down, no quorum.

Emergency Contacts

Issue	First action	Escalate if
No MongoDB PRIMARY	Check arbiter health first	Arbiter healthy but still no PRIMARY
bms-1 disk 100%	Run docker prune	Disk still 100% after cleanup
WAHA session down	POST /api/sessions/default/restart	Session fails to restart
Atrax GPS stale > 30 min	Check n8n container, trigger workflow	n8n healthy but workflow still fails

AI-Dev-BMS4 Status Check

# From local workstation — check if nightly ran successfully
ssh root@54.36.123.110 "tail -20 /var/log/bms4-nightly.log"
 
# Check last cron execution time
ssh root@54.36.123.110 "ls -la /var/log/bms4-nightly.log && grep 'END' /var/log/bms4-nightly.log | tail -3"
 
# Check active agent sessions on bms-4
# (requires SUPABASE credentials)

p24-infra Docs

Explorer

03-nightly-ops-and-mongodb

03 — Nightly Operations & MongoDB rs0 Maintenance

Table of Contents

Part 1: AI-Dev-BMS4 Agent Design

1. Agent Overview

Role

Position in the Agent Fleet

Capabilities

Constraints

2. Installation Checklist

Step 1 — Create claude-runner user

Step 2 — Install Claude Code

Step 3 — Copy OAuth credentials

Step 4 — Create SSH key for remote diagnostics

Step 5 — Set up GitHub credentials

Step 6 — Clone p24-infra repository

Step 7 — Create wrapper script

Step 8 — Create cron job

Step 9 — Create CLAUDE.md for the agent

Step 10 — Register in GitHub as AI-Dev-BMS4

Step 11 — Verify

3. Issue Pickup Logic

Triage Decision Tree

Issue Resolution Flow

Capacity Limit

4. Escalation Rules

Escalation Triggers

Discord Notification Format

GitHub Issue Labeling on Escalation

5. Agent Configuration

CLAUDE.md for bms-4 agent

.env.local secrets required on bms-4

systemd service file (alternative to cron)

Part 2: Nightly Operations

6. Nightly Operations Schedule

Time Budget (bms-4 resources, 32 GB RAM)

7. Tier 1 Critical Service Checks

7.1 Atrax GPS Sync (n8n workflow)

7.2 Docker Daemon Health — bms-1 and bms-3

7.3 WAHA WhatsApp Gateway

7.4 MongoDB rs0 Replica Set Health

7.5 Pinbox24 Production API

7.6 Supabase Connectivity and Queue Depths

7.7 Disk Usage — Critical Servers

8. Tier 2 Platform Checks

8.1 et-operational-platform Vercel Health

8.2 Grafana + Prometheus on vps-i1

8.3 n8n Workflow Execution Failures

8.4 SSL Certificate Expiry

8.5 Memory Usage — bms-3 MongoDB Risk

8.6 Traefik Routing Health on bms-4

9. Tier 3 Quality Checks

9.1 Supabase Slow Query Report

9.2 Docker Image CVE Summary

9.3 Backup Freshness Verification

9.4 GitHub Actions CI/CD Pipeline Health

9.5 MongoDB Slow Query Log

10. Supabase Maintenance

Nightly VACUUM (02:30 UTC, Mon–Sat)

Weekly REINDEX (Sunday 03:00 UTC)

Monthly Stats Reset

Table Bloat Check

11. n8n Workflow Health

Key Workflows to Monitor

n8n Health Check Script

Atrax Workflow Restart Procedure

12. Disk Usage Monitoring

Per-Server Thresholds and Cleanup Procedures

bms-1 (Pinbox24 production — CRITICAL, already at 100%)

bms-3 (staging + MongoDB)

bms-4 (this server)

13. Security Nightly Checks

13.1 fail2ban Status

13.2 SSH Auth Log Review

13.3 SSL Certificate Expiry

13.4 Credential Rotation Check

13.5 MongoDB Unauthorized Access Attempts

Part 3: MongoDB rs0 Maintenance