03 — Nightly Operations & MongoDB rs0 Maintenance
Status: Design document — 2026-06-14
Scope: AI-Dev-BMS4 agent setup · nightly operations checklist · MongoDB rs0 maintenance plan
Servers covered: bms-4 (54.36.123.110) · bms-3 (51.68.155.224) · bms-2 (145.239.133.104)
Related docs: p4-ovh-bms-4-ns3101999-operations.md · p4-ovh-bms-3-ns3129867-operations.md · p4-ovh-bms-2-ns3087638-operations.md
Table of Contents
Part 1: AI-Dev-BMS4 Agent Design
Part 2: Nightly Operations 6. Nightly Operations Schedule 7. Tier 1 Critical Service Checks 8. Tier 2 Platform Checks 9. Tier 3 Quality Checks 10. Supabase Maintenance 11. n8n Workflow Health 12. Disk Usage Monitoring 13. Security Nightly Checks
Part 3: MongoDB rs0 Maintenance 14. Replica Set Health Dashboard 15. Regular Maintenance Schedule 16. Maintenance Procedures 17. Backup Strategy 18. Failover Runbook 19. Capacity Planning Guide 20. Alert Definitions
Part 1: AI-Dev-BMS4 Agent Design
1. Agent Overview
Role
AI-Dev-BMS4 is the autonomous Claude Code agent deployed on bms-4 (54.36.123.110). Its primary responsibility is nightly infrastructure issue processing — it picks up GitHub issues created by GitHub Actions health checks, attempts automated remediation, and escalates unresolvable problems to human operators.
Position in the Agent Fleet
| Agent | Host | Role | Max Parallel |
|---|---|---|---|
| AI-Dev-IO1 | vps-i1 (IONOS) | et-operational-platform issue processing | 2–3 |
| AI-Dev-HS1 | vps-h1 (Hostinger) | p24-infra issue pipeline + claude-proxy | 1–2 |
| AI-Dev-OV1 | bms-2 (OVH) | dev/test workloads | 4 |
| AI-Dev-BMS4 | bms-4 (OVH) | nightly p24-infra ops + MongoDB maintenance | 4 |
AI-Dev-BMS4 is specifically designed for the 02:00–06:00 UTC nightly window when GitHub Actions have generated issues from health checks and infrastructure scans. It runs on the server with the most free RAM (30+ GB free) and disk space (1.7 TB free).
Capabilities
- Clone and operate on
radieu/p24-infrarepository (dedicated clone at/home/claude-runner/p24-infra) - Run diagnostic commands:
curl,docker,mongosh,ssh(read-only diagnostics) - Create GitHub issues, add comments, apply labels, and open PRs to
dev - Query Prometheus metrics API for service health
- Send Discord notifications for immediate alerts
- Restart failed Docker containers on bms-4 only (owns its own host)
- SSH read-only access to vps-i1, vps-h1, bms-2, bms-3 for diagnostics
Constraints
- Never write to production databases or modify MongoDB data
- Never restart containers on remote servers (vps-h1, bms-3) — SSH is read-only for diagnostics
- Never push directly to
main— all changes via PR todev - Never expose secret values in GitHub issue comments
- Always create a recovery path before any action with data loss risk
2. Installation Checklist
bms-4 currently runs as root (OVH bare metal default). The following steps bring it into the same pattern as other agent VPSes.
Step 1 — Create claude-runner user
ssh root@54.36.123.110
# Create dedicated user (no password, no sudo by default)
useradd -m -s /bin/bash claude-runner
usermod -aG docker claude-runner # allow docker commands on this host only
# Create SSH directory for agent access
mkdir -p /home/claude-runner/.ssh
chmod 700 /home/claude-runner/.ssh
chown -R claude-runner:claude-runner /home/claude-runner/.ssh
# Create .claude directory for credentials
mkdir -p /home/claude-runner/.claude
chown -R claude-runner:claude-runner /home/claude-runner/.claudeStep 2 — Install Claude Code
# Install Node.js 22.x (required by Claude Code)
curl -fsSL https://deb.nodesource.com/setup_22.x | bash -
apt-get install -y nodejs
# Install Claude Code globally
npm install -g @anthropic-ai/claude-code
# Verify
claude --version
which claude # expect /usr/bin/claude or /usr/local/bin/claudeStep 3 — Copy OAuth credentials
From local workstation, copy a valid .credentials.json that has both accessToken and refreshToken:
# From local Windows workstation
scp C:\Users\konar\.claude\.credentials.json root@54.36.123.110:/home/claude-runner/.claude/.credentials.json
ssh root@54.36.123.110 "chown claude-runner:claude-runner /home/claude-runner/.claude/.credentials.json && chmod 600 /home/claude-runner/.claude/.credentials.json"Verify Claude Code authenticates:
su - claude-runner -c "claude --version"Step 4 — Create SSH key for remote diagnostics
The agent needs read-only SSH access to other servers for diagnostic commands.
# As root on bms-4 — generate key for claude-runner
su - claude-runner -c "ssh-keygen -t ed25519 -f /home/claude-runner/.ssh/id_bms4_agent -C 'ai-dev-bms4@bms-4' -N ''"
# Display public key — copy this to authorized_keys on other servers
cat /home/claude-runner/.ssh/id_bms4_agent.pubThen on each target server, add the public key to the read-only diagnostic user:
# On vps-i1 (IONOS) — add to claude-admin for diagnostic SSH
ssh root@217.154.82.162 "echo '<bms4_agent_pubkey>' >> /home/claude-admin/.ssh/authorized_keys"
# On vps-h1 (Hostinger) — same
ssh root@72.60.32.61 "echo '<bms4_agent_pubkey>' >> /home/claude-admin/.ssh/authorized_keys"
# On bms-3 — same
ssh ubuntu@51.68.155.224 "echo '<bms4_agent_pubkey>' >> /home/ubuntu/.ssh/authorized_keys"Configure SSH client to use the correct key per host:
cat > /home/claude-runner/.ssh/config << 'EOF'
Host vps-i1 217.154.82.162
HostName 217.154.82.162
User claude-admin
IdentityFile ~/.ssh/id_bms4_agent
StrictHostKeyChecking no
ConnectTimeout 10
Host vps-h1 72.60.32.61
HostName 72.60.32.61
User root
IdentityFile ~/.ssh/id_bms4_agent
StrictHostKeyChecking no
ConnectTimeout 10
Host bms-3 51.68.155.224
HostName 51.68.155.224
User ubuntu
IdentityFile ~/.ssh/id_bms4_agent
StrictHostKeyChecking no
ConnectTimeout 10
Host bms-2 145.239.133.104
HostName 145.239.133.104
User ubuntu
IdentityFile ~/.ssh/id_bms4_agent
StrictHostKeyChecking no
ConnectTimeout 10
EOF
chown claude-runner:claude-runner /home/claude-runner/.ssh/config
chmod 600 /home/claude-runner/.ssh/configStep 5 — Set up GitHub credentials
# Install gh CLI
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg \
| dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" \
| tee /etc/apt/sources.list.d/github-cli.list
apt-get update && apt-get install -y gh
# Create env file with GitHub token — populated from .env.local
cat > /home/claude-runner/.claude-env << 'EOF'
export GITHUB_TOKEN=<value from .env.local P24_INFRA_GH_TOKEN>
export GH_TOKEN=$GITHUB_TOKEN
export DISCORD_WEBHOOK_URL=<value from .env.local P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL>
export PROMETHEUS_URL=http://217.154.82.162:9090
EOF
chmod 600 /home/claude-runner/.claude-env
chown claude-runner:claude-runner /home/claude-runner/.claude-envNever store the actual token values in this document. Populate .claude-env from .env.local on the local workstation.
Step 6 — Clone p24-infra repository
# As claude-runner — dedicated clone (NOT /opt/p24-infra which is the deployment copy)
su - claude-runner
source ~/.claude-env
git clone https://${GITHUB_TOKEN}@github.com/radieu/p24-infra.git ~/p24-infra
cd ~/p24-infra
git checkout dev
git remote set-url origin https://github.com/radieu/p24-infra.git # remove token from remote URLThe wrapper script injects the token at runtime.
Step 7 — Create wrapper script
cat > /root/bms4-nightly.sh << 'SCRIPT'
#!/usr/bin/env bash
set -euo pipefail
LOG="/var/log/bms4-nightly.log"
echo "=== bms4-nightly START $(date -u +%Y-%m-%dT%H:%M:%SZ) ===" >> "$LOG"
# Run as claude-runner
runuser -u claude-runner -- bash -c '
source ~/.claude-env
cd ~/p24-infra
git remote set-url origin "https://${GITHUB_TOKEN}@github.com/radieu/p24-infra.git"
git fetch origin dev
git reset --hard origin/dev
git remote set-url origin "https://github.com/radieu/p24-infra.git"
claude --dangerously-skip-permissions -p "/process-issues" \
--allowedTools "Bash,Read,Edit,Write,Glob,Grep,PowerShell"
' >> "$LOG" 2>&1
echo "=== bms4-nightly END $(date -u +%Y-%m-%dT%H:%M:%SZ) ===" >> "$LOG"
SCRIPT
chmod +x /root/bms4-nightly.shStep 8 — Create cron job
# Nightly at 02:05 UTC (5 min after GitHub Actions health-check creates issues)
cat > /etc/cron.d/bms4-nightly << 'EOF'
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
5 2 * * * root /root/bms4-nightly.sh >> /var/log/bms4-nightly.log 2>&1
EOF
chmod 644 /etc/cron.d/bms4-nightlyStep 9 — Create CLAUDE.md for the agent
See Section 5 — Agent Configuration for the full CLAUDE.md content.
# Place the CLAUDE.md in the clone
cp /opt/p24-infra/docs/evaluation/bms4-agent-CLAUDE.md \
/home/claude-runner/p24-infra/.claude/CLAUDE.mdStep 10 — Register in GitHub as AI-Dev-BMS4
Following the AI runner provisioning procedure from CLAUDE.md:
- Create email routing rule in Cloudflare for
ai-dev-bms4@zintegrowana.online - Sign up GitHub account with that email
- Add as collaborator to
radieu/p24-infrawith write permission - Create human-action issue: “Confirm GitHub invitation for AI-Dev-BMS4”
Step 11 — Verify
# Test Claude Code auth
su - claude-runner -c "claude --version && echo AUTH_OK"
# Test GitHub auth
su - claude-runner -c "source ~/.claude-env && gh auth status"
# Test repo access
su - claude-runner -c "source ~/.claude-env && cd ~/p24-infra && gh issue list --repo radieu/p24-infra --limit 5"
# Test SSH diagnostics
su - claude-runner -c "ssh vps-i1 'docker ps --format table 2>/dev/null | head -5'"
# Test Discord notification
su - claude-runner -c "source ~/.claude-env && curl -s -X POST \"\$DISCORD_WEBHOOK_URL\" \
-H 'Content-Type: application/json' \
-d '{\"content\":\"AI-Dev-BMS4 setup verified — nightly agent ready\"}'"3. Issue Pickup Logic
Triage Decision Tree
New GitHub issue in radieu/p24-infra
│
├── Labels include "atrax-stale"
│ → check n8n workflow status on bms-4
│ → if n8n healthy, attempt workflow restart via API
│ → if n8n down, escalate immediately (Tier 1)
│
├── Labels include "server-down" / "infra-check-fail"
│ → identify failed component from issue title
│ → run targeted diagnostic (curl, ssh, docker ps)
│ → if container restart would fix: restart it
│ → if requires human SSH/physical: escalate
│
├── Labels include "failed-gh-actions"
│ → check workflow logs via gh API
│ → if transient (timeout, network): re-trigger workflow
│ → if code/config bug: open fix PR
│ → if secret expired / runner down: escalate
│
├── Labels include "security"
│ → never auto-remediate; always escalate to human
│ → add "human-action" label immediately
│
├── Labels include "triage" only (no routing label)
│ → apply /process-issues skill for standard triage
│
└── All other issues
→ apply /process-issues skill (Design → In Progress → PR)
Issue Resolution Flow
1. CLAIM — gh issue edit #N --add-label "in-progress"
— add comment: "AI-Dev-BMS4 picking up this issue [timestamp]"
2. DIAGNOSE — run relevant check commands (see Part 2)
— record findings in a comment
3. ACT
├── Resolvable automatically:
│ — implement fix (code change, config update, container restart)
│ — open PR to dev
│ — comment: "Fix in PR #NN"
│
└── Not resolvable:
— add comment: "Diagnostics complete. Root cause: <description>"
— add label "human-action"
— send Discord alert (see Section 4)
4. CLOSE (if fully resolved by PR merge, or label human-action and move on)
Capacity Limit
AI-Dev-BMS4 runs up to 4 parallel Claude Code processes during the nightly window (02:00–06:00 UTC). Use Supabase agent_sessions table with worker_env = 'bms4' to prevent over-claiming:
# Check active sessions before spawning
ACTIVE=$(curl -sf \
-H "apikey: $SUPABASE_SERVICE_KEY" \
"$SUPABASE_URL/rest/v1/agent_sessions?status=eq.active&worker_env=eq.bms4&select=count" \
| python3 -c "import sys,json; d=json.load(sys.stdin); print(d[0]['count'])")
if [ "$ACTIVE" -ge 4 ]; then
echo "At capacity ($ACTIVE/4 agents). Backing off."
exit 0
fi4. Escalation Rules
Escalation Triggers
| Condition | Tier | Action |
|---|---|---|
| Tier 1 service down (WAHA, Atrax GPS, Pinbox24 API, Supabase) | Immediate | Discord + GitHub issue comment + human-action label |
| MongoDB rs0 has no PRIMARY | Immediate | Discord + GitHub issue |
| bms-1 disk > 95% | Immediate | Discord + GitHub issue |
| Security issue (CVE, auth anomaly, credential expiry) | Immediate | Discord + GitHub issue + human-action |
| Issue cannot be diagnosed (SSH unreachable, auth failure) | 15 min | Discord warning + GitHub comment |
| Auto-fix attempted but failed twice | 30 min | Discord + human-action label |
| Agent OAuth expired | N/A | System cannot self-notify — GitHub Actions health-check backstop handles this |
Discord Notification Format
# Discord alert function — call this from within the agent or nightly script
send_discord_alert() {
local SEVERITY="$1" # "CRITICAL" | "WARNING" | "INFO"
local TITLE="$2"
local DESCRIPTION="$3"
local ISSUE_URL="${4:-}"
case "$SEVERITY" in
CRITICAL) COLOR=15158332 ;; # red
WARNING) COLOR=16776960 ;; # yellow
INFO) COLOR=3066993 ;; # green
*) COLOR=9807270 ;; # grey
esac
PAYLOAD=$(jq -nc \
--arg title "[$SEVERITY] $TITLE" \
--arg desc "$DESCRIPTION" \
--arg url "$ISSUE_URL" \
--argjson color "$COLOR" \
'{embeds: [{title: $title, description: $desc, url: $url, color: $color}]}')
curl -s -X POST "$DISCORD_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "$PAYLOAD" || true
}
# Usage examples:
send_discord_alert "CRITICAL" \
"AI-Dev-BMS4: WAHA session FAILED" \
"WhatsApp gateway session not WORKING. GPS incidents cannot be received. Manual restart required.\nHost: waha2.vps-h1.infra.zintegrowana.online" \
"https://github.com/radieu/p24-infra/issues/123"
send_discord_alert "WARNING" \
"AI-Dev-BMS4: bms-3 RAM at 87%" \
"MongoDB on bms-3 consuming 21.7 GB. Available RAM below safe threshold.\nConsider planned mongod restart during low-traffic window."GitHub Issue Labeling on Escalation
# When escalating an issue to human:
gh issue edit "$ISSUE_NUMBER" \
--repo radieu/p24-infra \
--add-label "human-action"
gh issue comment "$ISSUE_NUMBER" \
--repo radieu/p24-infra \
--body "$(cat << 'EOF'
## AI-Dev-BMS4 Escalation Report
**Timestamp:** $(date -u +%Y-%m-%dT%H:%M:%SZ)
**Agent:** AI-Dev-BMS4 (bms-4, 54.36.123.110)
**Reason for escalation:** <specific reason>
### Diagnostics performed
<list of commands run and their outputs>
### Root cause assessment
<what the agent determined>
### Recommended human action
<specific steps for the human operator>
### Cannot proceed because
<specific blocker — e.g., "Requires MongoDB admin password", "Requires physical server access">
EOF
)"5. Agent Configuration
CLAUDE.md for bms-4 agent
Place at /home/claude-runner/p24-infra/.claude/CLAUDE.md (overrides repo-level for this agent instance):
# CLAUDE.md — AI-Dev-BMS4 Nightly Agent
## Agent Identity
- **Label:** AI-Dev-BMS4
- **Host:** bms-4 (54.36.123.110, Ubuntu 22.04)
- **Role:** Nightly p24-infra issue processing + MongoDB maintenance
- **Max parallel agents:** 4
- **Active window:** 02:00–06:00 UTC
## Primary Tasks
1. Pick up GitHub issues in `radieu/p24-infra` with labels: `triage`, `server-down`,
`infra-check-fail`, `atrax-stale`, `failed-gh-actions`
2. Run /process-issues skill for triage/design/implementation
3. MongoDB rs0 health check commands (read-only mongosh)
4. Disk usage checks on all servers via SSH
5. Escalate unsolvable issues via Discord + human-action label
## Permissions
- Docker commands: ALLOWED on bms-4 only (this server)
- SSH diagnostics: ALLOWED to vps-i1, vps-h1, bms-2, bms-3 (read-only)
- GitHub PRs: ALWAYS target `dev` branch
- MongoDB: READ-ONLY — `rs.status()`, `db.serverStatus()`, profiler queries only
- NEVER: write to MongoDB, restart containers on remote servers, push to main
## Environment
Source `~/.claude-env` before any command requiring GITHUB_TOKEN or DISCORD_WEBHOOK_URL.
## Error Reporting
All errors → Discord via DISCORD_WEBHOOK_URL + GitHub issue comment.env.local secrets required on bms-4
The agent’s .claude-env (at /home/claude-runner/.claude-env) must contain:
| Variable | Source in .env.local | Purpose |
|---|---|---|
GITHUB_TOKEN | P24_INFRA_GH_TOKEN | gh CLI authentication |
GH_TOKEN | same | alias for gh |
DISCORD_WEBHOOK_URL | P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL | alerts |
SUPABASE_URL | SUPABASE_URL | agent_sessions coordination |
SUPABASE_SERVICE_KEY | SUPABASE_SERVICE_KEY | insert/read agent_sessions |
PROMETHEUS_URL | http://217.154.82.162:9090 | metrics queries |
The MongoDB admin password is NOT stored on bms-4 — the agent runs only read-only mongosh commands using the keyFile (internal cluster auth), not admin credentials.
systemd service file (alternative to cron)
For more reliable execution than cron:
# /etc/systemd/system/bms4-nightly.service
[Unit]
Description=AI-Dev-BMS4 Nightly p24-infra Issue Processing
After=network.target
[Service]
Type=oneshot
User=root
ExecStart=/root/bms4-nightly.sh
StandardOutput=append:/var/log/bms4-nightly.log
StandardError=append:/var/log/bms4-nightly.log
TimeoutStartSec=3600# /etc/systemd/system/bms4-nightly.timer
[Unit]
Description=AI-Dev-BMS4 Nightly Timer
Requires=bms4-nightly.service
[Timer]
OnCalendar=*-*-* 02:05:00 UTC
Persistent=true
[Install]
WantedBy=timers.targetsystemctl daemon-reload
systemctl enable bms4-nightly.timer
systemctl start bms4-nightly.timer
systemctl list-timers bms4-nightly.timerPart 2: Nightly Operations
6. Nightly Operations Schedule
All times UTC. AI-Dev-BMS4 orchestrates the sequence starting at 02:05 UTC after the initial GitHub Actions health checks at 02:00.
| Time (UTC) | Task | Responsible | Method |
|---|---|---|---|
| 02:00 | Health check — all Tier 1 services | GitHub Actions health-check.yml (every 2h) | GH Actions on ionos runner |
| 02:00 | Atrax GPS freshness check | GitHub Actions atrax-data-freshness.yml (every 10 min) | GH Actions on ionos runner |
| 02:05 | AI-Dev-BMS4 wakes up — picks up any open issues | AI-Dev-BMS4 cron | bms-4 |
| 02:10 | Disk usage check — all 5 servers | AI-Dev-BMS4 | SSH + df -h |
| 02:30 | MongoDB rs0 health check + replication metrics | AI-Dev-BMS4 | mongosh on bms-3 via SSH |
| 02:30 | DB maintenance VACUUM (Mon–Sat) | GitHub Actions db-maintenance.yml | GH Actions on ionos runner |
| 03:00 | Docker container audit — all hosts | AI-Dev-BMS4 | SSH + docker ps |
| 03:00 | DB weekly maintenance (Sunday only — REINDEX) | GitHub Actions db-maintenance.yml | GH Actions on ionos runner |
| 03:30 | SSL certificate expiry check | AI-Dev-BMS4 | curl + openssl |
| 04:00 | Trivy CVE scan | GitHub Actions trivy-scan.yml | GH Actions on ionos runner |
| 04:00 | n8n SQLite maintenance (Sunday only) | GitHub Actions n8n-maintenance.yml | GH Actions on ionos runner |
| 04:30 | Security log review (fail2ban, auth attempts) | AI-Dev-BMS4 | SSH + log analysis |
| 05:00 | Morning summary report — Discord | AI-Dev-BMS4 | Discord webhook |
| 05:00 | p24-infra issue pipeline (triage + design) | IONOS cron p24-infra-nightly.sh | vps-i1 |
| 05:30 | Close resolved issues, tag remaining | AI-Dev-BMS4 | gh CLI |
Time Budget (bms-4 resources, 32 GB RAM)
| Phase | Duration | Claude agents | RAM usage |
|---|---|---|---|
| Issue pickup + triage | 02:05–03:30 | 1–4 | 2.8–5.6 GB |
| MongoDB + disk checks | 02:10–02:40 | 0 (scripts) | <100 MB |
| Security + SSL | 03:30–04:30 | 1–2 | 1.4–2.8 GB |
| Report generation | 04:30–05:30 | 1 | 700 MB |
Total RAM budget: 30 GB free → well within limits even with 4 agents.
7. Tier 1 Critical Service Checks
These checks run at 02:00 UTC via health-check.yml and every 10 minutes for Atrax. AI-Dev-BMS4 supplements with deeper diagnostics when alerts are raised.
7.1 Atrax GPS Sync (n8n workflow)
What it is: n8n workflow ID AJ1px9uHIfbsriof syncs GPS tracking data from Atrax API to Supabase p24_gps_current_state table every 5 minutes.
Check command (run by GitHub Actions every 10 min):
# Check freshness — stale if last sync > 10 minutes ago
RESPONSE=$(curl -sf \
-H "apikey: $SUPABASE_SERVICE_KEY" \
"$SUPABASE_URL/rest/v1/p24_gps_current_state?select=n8n_synced_at&order=n8n_synced_at.desc&limit=1")
LAST=$(echo "$RESPONSE" | jq -r '.[0].n8n_synced_at // empty')
AGE=$(( $(date -u +%s) - $(date -u -d "$LAST" +%s) ))
[ "$AGE" -lt 600 ] && echo "OK: ${AGE}s" || echo "STALE: ${AGE}s"Success criterion: n8n_synced_at within last 600 seconds (10 min)
Failure action (AI-Dev-BMS4):
# 1. Check n8n container health on bms-4 (post-migration) or vps-h1 (pre-migration)
ssh root@54.36.123.110 "docker inspect root-n8n-1 --format '{{.State.Status}}'"
# 2. Check last execution in n8n API
curl -s -H "X-N8N-API-KEY: $N8N_API_KEY" \
"https://n8n.bms-4.infra.zintegrowana.online/api/v1/workflows/AJ1px9uHIfbsriof/executions?limit=3" \
| jq '.data[] | {finished, status, startedAt}'
# 3. If container down: alert immediately (cannot auto-restart — requires n8n workflow logic)
# 4. If workflow stuck: post GitHub issue with label "atrax-stale"SLA: Data must not be stale > 30 minutes. Page immediately at 30 min stale.
7.2 Docker Daemon Health — bms-1 and bms-3
What it is: Pinbox24 production (bms-1) and staging (bms-3) run on Docker. Daemon down = all containers down.
Check commands:
# bms-1 (Pinbox24 production) — SSH as root
ssh -i ~/.ssh/id_bms4_agent -o ConnectTimeout=10 root@94.23.26.113 \
"systemctl is-active docker && docker ps --format '{{.Names}}' | wc -l"
# bms-3 (Pinbox24 staging + MongoDB primary)
ssh ubuntu@51.68.155.224 \
"systemctl is-active docker && docker ps --format '{{.Names}}' | grep -c 'Up'"Success criterion: docker service = active, container count matches expected (bms-1: ~24, bms-3: ~11)
Failure action: Immediate Discord CRITICAL alert. Cannot auto-remediate — escalate with human-action label.
7.3 WAHA WhatsApp Gateway
What it is: WAHA container on vps-h1 receives WhatsApp messages for incident management. Session must be in WORKING state (not just container running).
Check command:
# Server liveness + session state (both required)
SERVER=$(curl -s -o /dev/null -w "%{http_code}" --max-time 15 \
-H "X-Api-Key: $WAHA_API_KEY" \
"https://waha2.vps-h1.infra.zintegrowana.online/api/server/status")
SESS_STATUS=$(curl -s --max-time 15 \
-H "X-Api-Key: $WAHA_API_KEY" \
"https://waha2.vps-h1.infra.zintegrowana.online/api/sessions/default" \
| python3 -c 'import sys,json; print(json.load(sys.stdin).get("status","?"))' 2>/dev/null)
echo "server=$SERVER session=$SESS_STATUS"
[ "$SERVER" = "200" ] && [ "$SESS_STATUS" = "WORKING" ] && echo "OK" || echo "FAIL"Success criterion: HTTP 200 AND session status = WORKING
Important: Server returning HTTP 200 does NOT mean session is healthy (proven during 2026-05-23 blackout). Always check session status separately.
Failure action:
- If container down: check
ssh root@72.60.32.61 "docker ps | grep waha" - If session FAILED/STOPPED: attempt session restart via WAHA API:
POST /api/sessions/default/restart - If restart fails: trigger
waha-session-restart.ymlviagh workflow run - If persistent: escalate with
human-action— may require re-scan of QR code
7.4 MongoDB rs0 Replica Set Health
What it is: Three-member replica set (bms-3 PRIMARY, bms-2 SECONDARY observer, bms-4 ARBITER). Requires at least 2 of 3 members for election quorum.
Check command:
# Run via SSH to bms-3 (which has cluster auth access)
# Uses keyFile internal auth — no admin password needed for rs.status()
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
const s = rs.status();
const members = s.members.map(m => ({name: m.name, state: m.stateStr, health: m.health}));
printjson({set: s.set, ok: s.ok, members: members});
'"Success criteria:
- Exactly 1 member in
PRIMARYstate health: 1for all members- Replication lag on bms-2 < 60 seconds
- No member in
RECOVERING,DOWN, orUNKNOWNstate
Failure action: See Section 18 — Failover Runbook
7.5 Pinbox24 Production API
What it is: The production fleet management API endpoint.
Check command:
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 10 --max-time 20 \
"https://api.w4.pinbox24.com/api/")
[ "$STATUS" = "200" ] || [ "$STATUS" = "401" ] \
&& echo "API reachable (HTTP $STATUS)" \
|| echo "API FAIL (HTTP $STATUS)"Success criterion: HTTP 200 or 401 (401 = API running but unauthenticated, which is correct)
Failure action: Immediate Discord CRITICAL. Check bms-1 Docker daemon. Cannot auto-remediate.
7.6 Supabase Connectivity and Queue Depths
What it is: Supabase is the primary database and coordination hub. Queue depth metrics reveal processing backlogs.
Check command:
# Connectivity
HTTP=$(curl -s -o /dev/null -w "%{http_code}" \
"$SUPABASE_URL/rest/v1/agent_sessions?select=count&limit=0" \
-H "apikey: $SUPABASE_SERVICE_KEY" --max-time 10)
echo "Supabase HTTP: $HTTP"
# Queue depths — check for backlogs
curl -sf \
-H "apikey: $SUPABASE_SERVICE_KEY" \
"$SUPABASE_URL/rest/v1/pending_transcriptions?select=count" \
| jq '.[0].count // 0' | xargs -I{} echo "pending_transcriptions: {}"
curl -sf \
-H "apikey: $SUPABASE_SERVICE_KEY" \
"$SUPABASE_URL/rest/v1/pending_pdf_processing?select=count" \
| jq '.[0].count // 0' | xargs -I{} echo "pending_pdf_processing: {}"Thresholds: pending_transcriptions > 100 or pending_pdf_processing > 50 = warning alert.
7.7 Disk Usage — Critical Servers
What it is: bms-1 is already at 100% disk capacity. All servers need monitoring.
Check commands:
# bms-1 (Pinbox24 production) — CRITICAL (already at 100%)
ssh root@94.23.26.113 "df -h / | tail -1"
# bms-3 (staging + MongoDB primary)
ssh ubuntu@51.68.155.224 "df -h / | tail -1"
# vps-i1 (monitoring)
ssh claude-admin@217.154.82.162 "df -h / | tail -1"
# vps-h1 (n8n + WAHA)
ssh root@72.60.32.61 "df -h / | tail -1"
# bms-4 (self — this server)
df -h / | tail -1Thresholds:
| Server | Warning | Critical | Action |
|---|---|---|---|
| bms-1 | 90% | 95% | Immediate escalation — disk already full |
| bms-3 | 70% | 80% | Docker prune + old log cleanup |
| vps-i1 | 80% | 90% | Docker image prune |
| vps-h1 | 75% | 85% | Docker prune |
| bms-4 | 60% | 75% | n8n data + Docker prune |
Cleanup commands for bms-3 (auto-applicable):
ssh ubuntu@51.68.155.224 "
# Remove unused Docker images (keeps running containers)
docker image prune -f
# Remove stopped containers and unused networks
docker container prune -f
# Truncate old logs (keep last 100 MB)
find /var/log -name '*.log' -size +100M -exec truncate -s 100M {} \;
"8. Tier 2 Platform Checks
These run nightly by AI-Dev-BMS4 at 03:00–04:00 UTC. Failure triggers a 30-minute response window.
8.1 et-operational-platform Vercel Health
# Check production deployment
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
--max-time 15 "${VERCEL_ETOP_PROD_URL}/api/health" 2>/dev/null \
|| echo "000")
echo "Vercel production: HTTP $STATUS"
# Check for recent failed deployments via Vercel API
curl -sf "https://api.vercel.com/v6/deployments?limit=5&target=production" \
-H "Authorization: Bearer $VERCEL_TOKEN" \
| jq '.deployments[] | {url, state, created: (.createdAt | todate)}'Thresholds: Any ERROR state deployment in last 24h → warning issue.
8.2 Grafana + Prometheus on vps-i1
# Grafana
curl -sf --max-time 10 \
"https://grafana.vps-i1.infra.zintegrowana.online/api/health" \
| jq .database
# Prometheus
curl -sf --max-time 10 \
"http://217.154.82.162:9090/-/healthy" && echo "Prometheus healthy"
# Check Prometheus targets (any DOWN?)
curl -sf "http://217.154.82.162:9090/api/v1/targets" \
| jq '[.data.activeTargets[] | select(.health != "up")] | length' \
| xargs -I{} echo "Unhealthy targets: {}"8.3 n8n Workflow Execution Failures
# Check recent executions — failures in last 24h
curl -sf \
-H "X-N8N-API-KEY: $N8N_API_KEY" \
"https://n8n.bms-4.infra.zintegrowana.online/api/v1/executions?status=error&limit=20" \
| jq '.data | length' | xargs -I{} echo "Failed executions (24h): {}"
# Get workflow names of failures
curl -sf \
-H "X-N8N-API-KEY: $N8N_API_KEY" \
"https://n8n.bms-4.infra.zintegrowana.online/api/v1/executions?status=error&limit=10" \
| jq '.data[] | {workflow: .workflowData.name, startedAt}'Threshold: > 5 failed executions in 24h → GitHub issue with label n8n-errors.
8.4 SSL Certificate Expiry
# Check all public-facing domains — warn if < 14 days
DOMAINS=(
"grafana.vps-i1.infra.zintegrowana.online"
"n8n.bms-4.infra.zintegrowana.online"
"waha2.vps-h1.infra.zintegrowana.online"
"traccar.vps-i1.infra.zintegrowana.online"
)
for domain in "${DOMAINS[@]}"; do
EXPIRY=$(echo | openssl s_client -servername "$domain" -connect "${domain}:443" 2>/dev/null \
| openssl x509 -noout -enddate 2>/dev/null \
| cut -d= -f2)
DAYS=$(( ( $(date -d "$EXPIRY" +%s) - $(date +%s) ) / 86400 ))
if [ "$DAYS" -lt 14 ]; then
echo "WARNING: $domain cert expires in $DAYS days"
else
echo "OK: $domain expires in $DAYS days"
fi
doneThreshold: < 14 days → warning issue. < 7 days → CRITICAL, immediate Discord alert.
8.5 Memory Usage — bms-3 MongoDB Risk
# bms-3 is most at risk: MongoDB using 21.7 GB of 32 GB total
ssh ubuntu@51.68.155.224 "
echo '--- Memory ---'
free -h
echo '--- MongoDB process ---'
ps aux --sort=-%mem | grep mongod | head -3
echo '--- Docker containers ---'
docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}' | head -15
"Threshold: Available memory < 4 GB on bms-3 → WARNING (MongoDB OOM risk).
8.6 Traefik Routing Health on bms-4
# Check Traefik dashboard API (not public)
ssh root@54.36.123.110 "
docker exec root-traefik-1 wget -qO- http://localhost:8080/api/rawdata 2>/dev/null \
| python3 -c \"import sys,json; d=json.load(sys.stdin); print('routers:', len(d.get('routers', {})))\"
docker inspect root-traefik-1 --format '{{.State.Status}}'
"9. Tier 3 Quality Checks
These run nightly and generate items for the morning report. No immediate alerting — create GitHub issues for tracking.
9.1 Supabase Slow Query Report
# Top 10 slowest queries by mean execution time
psql "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF sslmode=require dbname=postgres" \
-c "SELECT
LEFT(query, 80) AS query_snippet,
calls,
ROUND(mean_exec_time::numeric, 2) AS mean_ms,
ROUND(total_exec_time::numeric, 2) AS total_ms
FROM pg_stat_statements
WHERE mean_exec_time > 100
ORDER BY mean_exec_time DESC
LIMIT 10;"9.2 Docker Image CVE Summary
Handled by trivy-scan.yml at 04:00 UTC. AI-Dev-BMS4 checks for open Trivy issues:
gh issue list --repo radieu/p24-infra --label security --state open \
--json number,title,createdAt | jq '.[] | select(.title | contains("Trivy"))'9.3 Backup Freshness Verification
# Check Wasabi backup status JSON (written by backup-exporter)
curl -sf "http://217.154.82.162:9220/metrics" \
| grep "backup_last_successful_timestamp" \
| awk '{print $1, strftime("%Y-%m-%d %H:%M", $2)}'
# Check n8n backup freshness (last Sunday)
python3 - << 'EOF'
import boto3, certifi, datetime, os
s3 = boto3.client("s3",
endpoint_url="https://s3.eu-central-2.wasabisys.com",
region_name="eu-central-2",
verify=certifi.where()
)
objs = s3.list_objects_v2(Bucket="p24-infra", Prefix="n8n/")
if objs.get("Contents"):
latest = max(objs["Contents"], key=lambda x: x["LastModified"])
age = (datetime.datetime.now(datetime.timezone.utc) - latest["LastModified"]).days
print(f"n8n backup: {latest['Key']} ({age} days ago)")
EOF9.4 GitHub Actions CI/CD Pipeline Health
# Check failed scheduled workflows in last 24h
gh run list --repo radieu/p24-infra \
--status failure \
--created "$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)" \
--json workflowName,createdAt,url \
| jq '.[] | {workflow: .workflowName, time: .createdAt, url}'9.5 MongoDB Slow Query Log
# Check MongoDB profiler on bms-3 for queries > 100ms (last 24h)
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
db = db.getSiblingDB(\"admin\");
// Check profiler level (should be 1 = slow ops only, or 2 = all)
print(\"profiler:\", JSON.stringify(db.getProfilingStatus()));
// Query system.profile for recent slow operations
db.getSiblingDB(\"local\").system.profile.find(
{ millis: { \$gte: 100 }, ts: { \$gte: new Date(Date.now() - 86400000) } },
{ ns: 1, millis: 1, ts: 1, op: 1 }
).sort({ millis: -1 }).limit(10).forEach(printjson);
'"10. Supabase Maintenance
Nightly VACUUM (02:30 UTC, Mon–Sat)
Handled by db-maintenance.yml. This workflow connects via the Supabase pooler (IPv4-compatible) on port 5432 (session mode — required for VACUUM).
# Command reference for manual execution:
PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
"host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" \
-c "VACUUM ANALYZE;"Weekly REINDEX (Sunday 03:00 UTC)
# REINDEX CONCURRENTLY — does not block reads/writes
PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
"host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" \
-c "REINDEX DATABASE CONCURRENTLY postgres;"Monthly Stats Reset
# Reset pg_stat_statements counters monthly to avoid stale data
PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
"host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" \
-c "SELECT pg_stat_statements_reset();"Table Bloat Check
PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
"host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" << 'SQL'
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)
- pg_relation_size(schemaname||'.'||tablename)) AS index_size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
SQL11. n8n Workflow Health
Key Workflows to Monitor
| Workflow | ID | Schedule | Criticality |
|---|---|---|---|
| Atrax GPS Sync | AJ1px9uHIfbsriof | Every 5 min | Tier 1 |
| WAHA incident routing | (varies) | Event-triggered | Tier 1 |
| Daily GPS report | (varies) | 06:00 UTC | Tier 2 |
| Supabase queue drain | (varies) | Continuous | Tier 2 |
n8n Health Check Script
N8N_BASE="https://n8n.bms-4.infra.zintegrowana.online"
# Container health
ssh root@54.36.123.110 "docker inspect --format '{{.State.Health.Status}}' root-n8n-1"
# API health endpoint
curl -sf "${N8N_BASE}/healthz" && echo "n8n healthy"
# Active workflows count
curl -sf -H "X-N8N-API-KEY: $N8N_API_KEY" \
"${N8N_BASE}/api/v1/workflows?active=true" \
| jq '.data | length' | xargs -I{} echo "Active workflows: {}"
# Recent failures
curl -sf -H "X-N8N-API-KEY: $N8N_API_KEY" \
"${N8N_BASE}/api/v1/executions?status=error&limit=5" \
| jq '.data[] | {workflow: .workflowData.name, startedAt, status}'Atrax Workflow Restart Procedure
If Atrax GPS sync is stale but n8n is healthy:
# Manually trigger the Atrax GPS sync workflow via n8n API
curl -X POST \
-H "X-N8N-API-KEY: $N8N_API_KEY" \
-H "Content-Type: application/json" \
"${N8N_BASE}/api/v1/workflows/AJ1px9uHIfbsriof/activate" \
-d '{}'
# If workflow is already active but not executing, trigger a test run
curl -X POST \
-H "X-N8N-API-KEY: $N8N_API_KEY" \
-H "Content-Type: application/json" \
"${N8N_BASE}/api/v1/workflows/AJ1px9uHIfbsriof/test" \
-d '{"pinData":{}}'12. Disk Usage Monitoring
Per-Server Thresholds and Cleanup Procedures
bms-1 (Pinbox24 production — CRITICAL, already at 100%)
ssh root@94.23.26.113 "
echo '=== Disk usage ==='
df -h /
echo '=== Top space consumers ==='
du -sh /var/lib/docker /var/log /tmp 2>/dev/null | sort -rh | head -10
echo '=== Docker disk usage ==='
docker system df
"Auto-cleanup (safe on bms-1):
ssh root@94.23.26.113 "
# Remove stopped containers (not running Pinbox24)
docker container prune -f
# Remove dangling images only (NOT all unused — avoid removing p24 images)
docker image prune -f
# Truncate large log files
find /var/log -name '*.log' -size +50M -exec truncate -s 10M {} \;
# Clear /tmp
find /tmp -mtime +7 -delete 2>/dev/null || true
"Alert threshold: At 100% — any write failure could take down production. Alert at ANY usage > 95%.
bms-3 (staging + MongoDB)
ssh ubuntu@51.68.155.224 "
df -h /
du -sh /var/lib/mongodb /var/lib/docker 2>/dev/null
"Auto-cleanup:
ssh ubuntu@51.68.155.224 "
docker image prune -f
docker container prune -f
sudo journalctl --vacuum-time=7d
"Alert threshold: > 70% warning, > 80% critical.
bms-4 (this server)
df -h /
du -sh /var/lib/docker /var/lib/mongodb /home/claude-runner 2>/dev/null
docker system dfAlert threshold: > 60% warning (currently at 1% — 1.7 TB free).
13. Security Nightly Checks
13.1 fail2ban Status
# Check banned IPs on all servers
for host in root@94.23.26.113 ubuntu@51.68.155.224 root@54.36.123.110 root@72.60.32.61; do
echo "=== $host ==="
ssh -o ConnectTimeout=10 "$host" \
"fail2ban-client status sshd 2>/dev/null | grep -E 'Total banned|Currently banned'" \
2>/dev/null || echo "fail2ban not running or unreachable"
doneAlert threshold: > 50 new bans in 24h = potential brute-force attack in progress.
13.2 SSH Auth Log Review
# Failed SSH login attempts (last 24h) on each server
for host in ubuntu@51.68.155.224 root@54.36.123.110; do
echo "=== $host failed SSH ==="
ssh -o ConnectTimeout=10 "$host" \
"grep 'Failed password\|Invalid user\|Authentication failure' /var/log/auth.log 2>/dev/null \
| grep \"$(date -u --date='24 hours ago' '+%b %d')\\|$(date -u '+%b %d')\" \
| wc -l" 2>/dev/null | xargs -I{} echo "Failed attempts: {}"
doneAlert threshold: > 200 failed attempts/24h on any single server.
13.3 SSL Certificate Expiry
Covered in Section 8.4. Run nightly — auto-creates GitHub issue if < 14 days.
13.4 Credential Rotation Check
# Check credential-exporter metrics for rotation age
curl -sf "http://217.154.82.162:9210/metrics" \
| grep "credential_age_days" \
| sort -t= -k2 -rn \
| head -10Alert threshold: Any credential older than 80 days (rotation should be every 90 days).
13.5 MongoDB Unauthorized Access Attempts
# Check MongoDB logs for auth failures (bms-3 — likely PRIMARY)
ssh ubuntu@51.68.155.224 "
grep -i 'authentication\|authorization.*failed\|Unauthorized' \
/var/log/mongodb/mongod.log \
| tail -50 \
| grep \"$(date -u '+%Y-%m-%d')\"
"Part 3: MongoDB rs0 Maintenance
14. Replica Set Health Dashboard
Key Prometheus Metrics (PromQL)
The following metrics are scraped via the future mongodb-exporter integration. Until that exporter is deployed, use the mongosh commands below.
Replication lag (secondary behind primary):
# Once mongodb-exporter is deployed:
mongodb_rs_member_optime_date{state="SECONDARY"} - on() mongodb_rs_member_optime_date{state="PRIMARY"}
# Threshold alert: lag > 60sOplog window (how long before secondary falls off the oplog):
mongodb_mongod_replset_oplog_head_timestamp - mongodb_mongod_replset_oplog_tail_timestamp
# Healthy: > 24h (86400s)
# Warning: < 4hActive connections:
mongodb_connections{state="current"}
# Alert if > 80% of maxConnectionsRAM usage (node-level proxy metric):
# Until mongodb-exporter available, use node_memory_MemAvailable_bytes on bms-3
node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"} / 1024^3
# Alert: available < 4 GBmongosh Health Commands
Run these on bms-3 (PRIMARY) via SSH. Uses internal cluster auth (keyFile) — no admin password needed for status queries:
// Connect to bms-3 as cluster member (keyFile auth handles internal auth automatically)
// Run: ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '...'"
// === FULL REPLICA SET STATUS ===
rs.status()
// === CONCISE MEMBER STATUS ===
rs.status().members.forEach(m => {
const lag = m.lastHeartbeatMessage || '';
const optime = m.optime ? m.optime.ts : 'N/A';
print(`${m.name} | ${m.stateStr} | health:${m.health} | ${optime}`);
});
// === REPLICATION LAG ===
// Get PRIMARY and SECONDARY optimes, compute lag in seconds
const status = rs.status();
const primary = status.members.find(m => m.stateStr === 'PRIMARY');
const secondaries = status.members.filter(m => m.stateStr === 'SECONDARY');
secondaries.forEach(s => {
const lag = primary.optimeDate - s.optimeDate;
print(`${s.name} lag: ${lag / 1000}s`);
});
// === OPLOG WINDOW ===
use local;
const oplog = db.oplog.rs;
const head = oplog.find().sort({$natural: -1}).limit(1).next();
const tail = oplog.find().sort({$natural: 1}).limit(1).next();
const windowHours = (head.ts.t - tail.ts.t) / 3600;
print(`Oplog window: ${windowHours.toFixed(1)} hours`);
print(`Oplog size: ${db.runCommand({dbStats: 1, freeStorage: 0}).storageSize / 1024 / 1024 / 1024} GB`);
// === CONNECTION STATS ===
use admin;
const cs = db.serverStatus().connections;
print(`Connections: current=${cs.current} available=${cs.available} totalCreated=${cs.totalCreated}`);
// === SLOW QUERIES (profiler, last 100) ===
use admin;
db.setProfilingLevel(1, { slowms: 100 }); // Set if not already set
db.getSiblingDB("local").system.profile.find(
{ millis: { $gte: 100 } }
).sort({ ts: -1 }).limit(20).forEach(p => {
print(`${p.ts.toISOString()} | ${p.op} | ${p.ns} | ${p.millis}ms`);
});Grafana Dashboard Panels (to create)
| Panel | Metric/Query | Visualization |
|---|---|---|
| rs0 member states | rs.status() via scraper | Status map (3 nodes) |
| Replication lag | mongodb_rs_member_optime_date diff | Time series |
| Oplog window | head - tail timestamp | Stat panel |
| bms-3 RAM available | node_memory_MemAvailable_bytes | Gauge |
| bms-3 disk usage | node_filesystem_avail_bytes | Gauge |
| MongoDB connections | mongodb_connections{state="current"} | Time series |
15. Regular Maintenance Schedule
Every 15 Minutes (continuous monitoring)
Handled by Prometheus + Alertmanager (automated). No agent action needed unless alert fires.
- rs.status() health check (via future mongodb-exporter scrape)
- Replication lag check
- Connection count check
Daily (02:30 UTC — nightly ops window)
AI-Dev-BMS4 executes:
- Quick rs.status() check via SSH
- Check replication lag < 60 seconds
- Verify all 3 members visible and healthy
- Check MongoDB log for errors in last 24h
- Verify arbiter (bms-4) is in ARBITER state (not DOWN)
# Daily MongoDB health check script
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
const s = rs.status();
const healthy = s.members.filter(m => m.health === 1).length;
const primary = s.members.find(m => m.stateStr === \"PRIMARY\");
if (!primary) { print(\"CRITICAL: No PRIMARY\"); quit(1); }
if (healthy < 2) { print(\"CRITICAL: Only \" + healthy + \" healthy members\"); quit(1); }
print(\"rs0 OK: PRIMARY=\" + primary.name + \" healthy=\" + healthy + \"/\" + s.members.length);
'"Weekly (Sunday 03:00 UTC)
- Oplog size and window analysis
- Index statistics — identify unused indexes
- Collection statistics — size, count, fragmentation
- Profiler slow query report
- Review MongoDB error log summary
# Weekly MongoDB analysis — run on bms-3 as PRIMARY
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
use local;
const oplog = db.oplog.rs;
const head = oplog.find().sort({$natural: -1}).limit(1).next();
const tail = oplog.find().sort({$natural: 1}).limit(1).next();
const windowHours = (head.ts.t - tail.ts.t) / 3600;
print(\"=== OPLOG ===\");
print(\"Window: \" + windowHours.toFixed(1) + \" hours\");
const stats = db.runCommand({dbStats: 1});
print(\"Oplog size: \" + (stats.storageSize / 1024/1024/1024).toFixed(2) + \" GB\");
print(\"\");
print(\"=== DATABASES ===\");
db.adminCommand({listDatabases: 1}).databases.forEach(d => {
print(d.name + \": \" + (d.sizeOnDisk / 1024/1024).toFixed(0) + \" MB\");
});
'"Monthly (1st Sunday of month, 03:00 UTC)
- Review replica set configuration — member priorities and votes
- Check keyFile md5 consistency across all members
- Rolling restart assessment (needed if mongod version update pending)
- Index optimization — drop unused indexes, rebuild fragmented ones
- Capacity forecast — disk and RAM trend analysis
- Review and rotate monitoring credentials
16. Maintenance Procedures
16.1 MongoDB Version Check and Update Assessment
# Check versions on all members
for host in "ubuntu@51.68.155.224" "ubuntu@145.239.133.104" "root@54.36.123.110"; do
echo "=== $host ==="
ssh -o ConnectTimeout=10 "$host" "mongod --version 2>/dev/null | head -1" 2>/dev/null
doneCurrent versions:
- bms-3: 7.0.26
- bms-2: 7.0.25
- bms-4: 7.0.37
Minor version skew is acceptable within the 7.0.x series. Plan rolling update when any member falls 2+ minor versions behind the current 7.0.x stable.
16.2 Rolling Member Restart (for mongod updates)
Never restart all members simultaneously. Always follow this order to maintain service:
- Restart SECONDARY member first (bms-2 observer — least impact)
- Wait for SECONDARY to rejoin and catch up (lag = 0)
- Restart ARBITER (bms-4) — no data, fast restart
- Step down PRIMARY (bms-3) — triggers election
- Wait for new PRIMARY election
- Restart former PRIMARY (bms-3)
# Step 1: Restart bms-2 (SECONDARY observer)
ssh ubuntu@145.239.133.104 "sudo systemctl restart mongod"
# Wait for it to rejoin
sleep 30
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
const bms2 = rs.status().members.find(m => m.name.includes(\"145.239\"));
print(bms2.name, bms2.stateStr, \"lag:\", (new Date() - bms2.optimeDate) / 1000 + \"s\");
'"
# Step 2: Restart bms-4 (ARBITER — only if joined)
ssh root@54.36.123.110 "systemctl restart mongod"
sleep 10
# Step 3: Step down bms-3 PRIMARY
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval 'rs.stepDown(60)'"
# Wait for election (typically < 12s)
sleep 15
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
const p = rs.status().members.find(m => m.stateStr === \"PRIMARY\");
if (p) print(\"New PRIMARY:\", p.name); else print(\"No PRIMARY yet — wait\");
'"
# Step 4: Restart former PRIMARY (bms-3)
ssh ubuntu@51.68.155.224 "sudo systemctl restart mongod"
sleep 30
# Step 5: Verify rs0 fully healthy
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
rs.status().members.forEach(m => print(m.name, m.stateStr, \"health:\", m.health));
'"16.3 Index Optimization
// Identify unused indexes (run on PRIMARY during weekly maintenance)
// Run: mongosh --quiet --eval '...' on bms-3
// For each database, check index usage stats
db.adminCommand({listDatabases: 1}).databases
.filter(d => !['admin', 'local', 'config'].includes(d.name))
.forEach(d => {
const db2 = db.getSiblingDB(d.name);
db2.getCollectionNames().forEach(coll => {
db2.runCommand({aggregate: coll, pipeline: [
{$indexStats: {}},
{$match: {'accesses.ops': {$lt: 5}}} // Used < 5 times
], cursor: {}}).cursor.firstBatch.forEach(idx => {
print(`UNUSED: ${d.name}.${coll} index: ${JSON.stringify(idx.key)} ops:${idx.accesses.ops}`);
});
});
});16.4 Oplog Size Adjustment
The default oplog size on Ubuntu/MongoDB 7.0 is ~5% of free disk space or 990 MB minimum. For bms-3 (410 GB disk, ~170 GB used), this should be ~10 GB.
// Check current oplog configuration
use local;
db.oplog.rs.stats().maxSize / 1024 / 1024 / 1024 + " GB"
// Resize oplog if needed (requires PRIMARY, MongoDB 3.6+)
// WARNING: This requires mongod config change + restart
// Add to /etc/mongod.conf:
// replication:
// oplogSizeMB: 10240 # 10 GB
// Then restart mongodRecommended oplog size: 10 GB on bms-3 (covers ~72h of operations at current write volume).
17. Backup Strategy
Backup Architecture
| Backup type | Source | Destination | Schedule | Retention |
|---|---|---|---|---|
| mongodump (full) | bms-3 (PRIMARY) | Wasabi p24-infra/mongodb/ | Weekly, Sunday 01:00 UTC | 4 weeks |
| mongodump (incremental/oplog) | bms-3 (PRIMARY) | Wasabi p24-infra/mongodb/oplog/ | Daily, 01:00 UTC | 7 days |
| bms-2 disk snapshot | bms-2 (observer) | OVH snapshot API | Monthly | 2 snapshots |
Why backup from PRIMARY (bms-3) not from SECONDARY? bms-2 is designated as observer and dev environment host — its MongoDB data is kept current but not treated as the backup source. Backups run from PRIMARY to ensure the most up-to-date data.
Why not backup from arbiter (bms-4)? Arbiters store no data.
Daily Oplog Backup Script
# /root/mongodb-oplog-backup.sh on bms-3
#!/usr/bin/env bash
set -euo pipefail
DATE=$(date -u +%Y-%m-%d)
BACKUP_DIR="/tmp/mongodb-oplog-${DATE}"
S3_KEY="mongodb/oplog/oplog-${DATE}.tar.gz"
# Dump oplog only (last 24h of operations)
mongodump \
--host 127.0.0.1:27017 \
--authenticationDatabase admin \
-u admin -p "$MONGODB_ADMIN_PASSWORD" \
--db local \
--collection oplog.rs \
--query '{"ts": {"$gte": Timestamp('"$(date -u --date='25 hours ago' +%s)"', 0)}}' \
--out "$BACKUP_DIR"
# Compress
tar czf "/tmp/${DATE}-oplog.tar.gz" -C "$BACKUP_DIR" .
rm -rf "$BACKUP_DIR"
# Upload to Wasabi
python3 - << PYEOF
import boto3, certifi, os
s3 = boto3.client("s3",
endpoint_url="https://s3.eu-central-2.wasabisys.com",
region_name="eu-central-2",
verify=certifi.where()
)
with open(f"/tmp/${DATE}-oplog.tar.gz", "rb") as f:
s3.upload_fileobj(f, "p24-infra", "${S3_KEY}")
print(f"Uploaded: s3://p24-infra/${S3_KEY}")
PYEOF
rm -f "/tmp/${DATE}-oplog.tar.gz"Weekly Full Backup Script
# /root/mongodb-full-backup.sh on bms-3
#!/usr/bin/env bash
set -euo pipefail
DATE=$(date -u +%Y-%m-%d)
BACKUP_DIR="/tmp/mongodb-full-${DATE}"
S3_KEY="mongodb/full/mongodb-full-${DATE}.tar.gz"
# Full dump of all databases
mongodump \
--host 127.0.0.1:27017 \
--authenticationDatabase admin \
-u admin -p "$MONGODB_ADMIN_PASSWORD" \
--oplog \
--out "$BACKUP_DIR"
# Compress
tar czf "/tmp/${DATE}-full.tar.gz" -C "$BACKUP_DIR" .
BACKUP_SIZE=$(du -sh "/tmp/${DATE}-full.tar.gz" | cut -f1)
rm -rf "$BACKUP_DIR"
# Upload to Wasabi
python3 - << PYEOF
import boto3, certifi, os, datetime
s3 = boto3.client("s3",
endpoint_url="https://s3.eu-central-2.wasabisys.com",
region_name="eu-central-2",
verify=certifi.where()
)
with open(f"/tmp/${DATE}-full.tar.gz", "rb") as f:
s3.upload_fileobj(f, "p24-infra", "${S3_KEY}")
print(f"Uploaded: s3://p24-infra/${S3_KEY} (${BACKUP_SIZE})")
PYEOF
rm -f "/tmp/${DATE}-full.tar.gz"
echo "Weekly backup complete: $S3_KEY ($BACKUP_SIZE)"Backup Verification (Weekly)
# Verify last backup is restorable — extract to temp location and run mongod --dbpath
DATE=$(date -u --date='last Sunday' +%Y-%m-%d)
S3_KEY="mongodb/full/mongodb-full-${DATE}.tar.gz"
# Download and verify (bms-2 has spare disk and RAM)
ssh ubuntu@145.239.133.104 "
python3 -c \"
import boto3, certifi
s3 = boto3.client('s3', endpoint_url='https://s3.eu-central-2.wasabisys.com',
region_name='eu-central-2', verify=certifi.where())
s3.download_file('p24-infra', '${S3_KEY}', '/tmp/backup-verify.tar.gz')
print('Download OK:', s3.head_object(Bucket='p24-infra', Key='${S3_KEY}')['ContentLength'], 'bytes')
\"
# Extract and verify structure
tar tzf /tmp/backup-verify.tar.gz | head -20
rm -f /tmp/backup-verify.tar.gz
echo 'Backup structure verified'
"18. Failover Runbook
18.1 Planned Failover (Maintenance Step-down)
Use this when taking bms-3 (PRIMARY) offline for maintenance.
Prerequisites:
- Confirm bms-2 (SECONDARY) is caught up (lag = 0)
- Confirm bms-4 (ARBITER) is healthy
- Notify team via Discord before starting
// 1. Check pre-conditions
rs.status().members.forEach(m => print(m.name, m.stateStr, "health:", m.health));
// 2. Force step-down (bms-3 steps down for 120s, forcing election)
// Run on bms-3:
rs.stepDown(120);
// 3. Verify new PRIMARY elected
// Run on bms-2 (it may become PRIMARY, or stay SECONDARY if votes are insufficient)
// Note: bms-2 is non-voting (votes:0) — this means WITH ARBITER DOWN,
// bms-3 and bms-2 cannot elect a new primary (need a voting member)
// bms-4 ARBITER provides the tie-breaking vote
rs.status();Important quorum consideration: rs0 has:
- bms-3: 1 vote (PRIMARY or SECONDARY)
- bms-2: 0 votes (observer — non-voting)
- bms-4: 1 vote (ARBITER)
Total votes: 2. Majority needed: 2. If arbiter (bms-4) is DOWN, bms-3 cannot hold PRIMARY (loses majority = only 1/2 votes). The arbiter’s presence is essential for quorum.
18.2 Emergency Failover (Primary Fails Unexpectedly)
# Step 1: Verify the situation
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval 'rs.status()'" 2>/dev/null \
|| echo "bms-3 unreachable"
# Step 2: Check from bms-2 perspective
ssh ubuntu@145.239.133.104 "mongosh --quiet --eval '
const s = rs.status();
s.members.forEach(m => print(m.name, m.stateStr, m.health));
print(\"myState:\", s.myState);
'"
# Step 3: If arbiter (bms-4) is healthy and bms-3 is truly down:
# The replica set CANNOT elect bms-2 as PRIMARY because bms-2 has priority:0
# The set will be in READ-ONLY state (only 1 vote: arbiter, but arbiter cannot become primary)
#
# RECOVERY OPTIONS:
# A) Restore bms-3 and let it rejoin as PRIMARY (preferred)
# B) Change bms-2 priority to 1 to allow it to become PRIMARY (emergency only)
# Requires admin credentials — this is a HUMAN ACTION
# Step 4 (Human Action): Temporarily promote bms-2
# Run on bms-2 with admin credentials:
# cfg = rs.conf();
# cfg.members[0].priority = 1; // adjust index for bms-2 member
# cfg.members[0].votes = 1;
# rs.reconfig(cfg);
# rs.status(); // should elect bms-2 as PRIMARY
# Step 5: Alert human operator immediately
source ~/.claude-env
curl -X POST "$DISCORD_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"content":"CRITICAL: MongoDB rs0 has no PRIMARY. bms-3 may be down. HUMAN ACTION REQUIRED immediately. See runbook section 18.2."}'18.3 Arbiter Loss Recovery
If bms-4 (ARBITER) is down:
- Immediate impact: rs0 can still operate (bms-3 and bms-2 form quorum if bms-3 has 1 vote and bms-2 has 0 votes — actually bms-3 alone with 1/2 votes loses quorum)
- Wait: Actually with bms-2 non-voting, losing the arbiter means bms-3 has 1/2 votes (needs 2) → rs0 loses quorum and bms-3 becomes SECONDARY
- Critical impact: All writes to MongoDB stop until arbiter is restored
- Resolution: Restart mongod on bms-4 first — it’s the quickest recovery path
ssh root@54.36.123.110 "
systemctl status mongod
systemctl restart mongod
sleep 10
systemctl status mongod
"
# Verify arbiter rejoined
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
const arb = rs.status().members.find(m => m.stateStr === \"ARBITER\");
if (arb) print(\"Arbiter OK:\", arb.name); else print(\"Arbiter NOT found\");
'"19. Capacity Planning Guide
Current Baseline (2026-06-14)
| Server | RAM Total | RAM Used | Disk Total | Disk Used | Growth Risk |
|---|---|---|---|---|---|
| bms-3 (PRIMARY) | 32 GB | 21.7 GB (MongoDB) + ~4 GB Docker | 410 GB | 170 GB (44%) | HIGH — OOM risk |
| bms-2 (observer) | 32 GB | ~2 GB (MongoDB replica) | 410 GB | 62 GB (16%) | LOW |
| bms-4 (arbiter) | 32 GB | ~75 MB (mongod arbiter) | 1.8 TB | 8.3 GB (1%) | VERY LOW |
RAM Capacity Thresholds — bms-3
| Available RAM | Action |
|---|---|
| > 8 GB | Normal |
| 4–8 GB | Warning: monitor hourly |
| 2–4 GB | Alert: plan immediate maintenance window |
| < 2 GB | Critical: initiate emergency mongod restart or workload migration |
bms-3 RAM forecast: MongoDB 7.0 uses WiredTiger cache = 50% of RAM by default = 16 GB. Current data set appears to fill this. As data grows, MongoDB may attempt to use more. Watch wiredTigerCacheBytesInUse metric.
// Check WiredTiger cache usage
use admin;
const st = db.serverStatus().wiredTiger.cache;
print("Cache target:", st["maximum bytes configured"] / 1024/1024/1024 + " GB");
print("Cache used:", st["bytes currently in the cache"] / 1024/1024/1024 + " GB");
print("Evictions:", st["pages evicted by application threads"]);Disk Capacity Thresholds — bms-3
MongoDB data directory + Docker images + logs on 410 GB:
- MongoDB: ~150 GB estimated (oplog + data)
- Docker: ~20 GB (staging images)
- Available headroom: ~200 GB at 44% used
Action triggers:
- 60% disk (246 GB used) → run
docker image prune - 70% disk (287 GB used) → evaluate moving old staging versions to archive
- 80% disk (328 GB used) → emergency cleanup or MongoDB oplog resize
Key Metrics to Track Weekly
db.serverStatus().opcounters— query/insert/update/delete rate- Oplog window (hours) — must stay > 24h for safe secondary operations
- Replication lag trend — should stay near 0
db.serverStatus().connections.current— connection count trend- WiredTiger cache eviction rate — high eviction = memory pressure
Capacity Forecast Queries
// Database size growth (run weekly, track over time)
use admin;
db.adminCommand({listDatabases: 1}).databases
.filter(d => d.name !== 'local')
.sort((a, b) => b.sizeOnDisk - a.sizeOnDisk)
.forEach(d => {
print(`${d.name}: ${(d.sizeOnDisk / 1024/1024/1024).toFixed(3)} GB`);
});20. Alert Definitions
Prometheus Alert Rules for MongoDB rs0
Add to monitoring/prometheus/rules/mongodb.yml:
groups:
- name: mongodb_rs0
interval: 60s
rules:
# ─── Connectivity ──────────────────────────────────────────────────────────
- alert: MongoDBMemberDown
expr: |
# Port probe — until mongodb-exporter available
probe_success{job="blackbox_tcp", instance=~".*27017.*"} == 0
for: 5m
labels:
severity: critical
team: infra
annotations:
summary: "MongoDB member unreachable: {{ $labels.instance }}"
description: "MongoDB port 27017 not responding on {{ $labels.instance }}. Check if mongod is running."
# ─── RAM pressure on bms-3 ─────────────────────────────────────────────────
- alert: BMS3MemoryCritical
expr: |
node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"} / 1024^3 < 2
for: 5m
labels:
severity: critical
team: infra
annotations:
summary: "bms-3 available RAM < 2 GB"
description: "MongoDB PRIMARY (bms-3) has < 2 GB available RAM. OOM risk is high. MongoDB is using ~21.7 GB. Immediate action required."
- alert: BMS3MemoryWarning
expr: |
node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"} / 1024^3 < 4
for: 10m
labels:
severity: warning
team: infra
annotations:
summary: "bms-3 available RAM < 4 GB"
description: "MongoDB PRIMARY (bms-3) memory is getting low. Available: {{ $value | humanize }}B. Plan maintenance window."
# ─── Disk usage ────────────────────────────────────────────────────────────
- alert: BMS3DiskWarning
expr: |
(node_filesystem_size_bytes{server="p4-ovh-bms-3-ns3129867", mountpoint="/"} -
node_filesystem_avail_bytes{server="p4-ovh-bms-3-ns3129867", mountpoint="/"}) /
node_filesystem_size_bytes{server="p4-ovh-bms-3-ns3129867", mountpoint="/"} > 0.70
for: 15m
labels:
severity: warning
team: infra
annotations:
summary: "bms-3 disk usage > 70%"
description: "Disk on bms-3 is {{ $value | humanizePercentage }} full. Run docker image prune and check MongoDB oplog size."
- alert: BMS1DiskCritical
expr: |
(node_filesystem_size_bytes{server="p4-ovh-bms-1-ns367522", mountpoint="/"} -
node_filesystem_avail_bytes{server="p4-ovh-bms-1-ns367522", mountpoint="/"}) /
node_filesystem_size_bytes{server="p4-ovh-bms-1-ns367522", mountpoint="/"} > 0.95
for: 5m
labels:
severity: critical
team: infra
annotations:
summary: "bms-1 (Pinbox24 PRODUCTION) disk > 95%"
description: "Pinbox24 production server disk at {{ $value | humanizePercentage }}. Writes may fail. EMERGENCY: immediate cleanup required."
# ─── Replication (once mongodb-exporter is deployed) ──────────────────────
- alert: MongoDBReplicationLagHigh
expr: |
# Placeholder — replace with actual mongodb-exporter metric
# mongodb_rs_member_optime_date{state="SECONDARY"} - on() mongodb_rs_member_optime_date{state="PRIMARY"} > 60
absent(up{job="mongodb"}) == 1
for: 5m
labels:
severity: warning
team: infra
annotations:
summary: "MongoDB exporter not yet deployed"
description: "Deploy mongodb-exporter to get replication lag metrics. Manual check: ssh ubuntu@51.68.155.224 mongosh --eval 'rs.status()'"
# ─── Blackbox TCP probes (to create in prometheus.yml) ───────────────────
# Add to monitoring/prometheus/prometheus.yml blackbox job:
# - targets:
# - 51.68.155.224:27017 # bms-3 MongoDB
# - 145.239.133.104:27017 # bms-2 MongoDB
# - 54.36.123.110:27017 # bms-4 MongoDB arbiter
# labels: { job: blackbox_tcp, module: tcp_connect }Alertmanager Routing for MongoDB
Add to monitoring/alertmanager/config.yml:
routes:
- match:
team: infra
severity: critical
receiver: discord-critical
repeat_interval: 30m
continue: true
- match:
team: infra
severity: warning
receiver: discord-warning
repeat_interval: 4hManual Alert Test
# Send test alert via Alertmanager API
curl -X POST http://217.154.82.162:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "MongoDBTestAlert",
"severity": "warning",
"team": "infra",
"instance": "54.36.123.110:27017"
},
"annotations": {
"summary": "Test alert from runbook",
"description": "Manual test of MongoDB alert routing"
},
"endsAt": "'"$(date -u -d '+5 minutes' +%Y-%m-%dT%H:%M:%SZ)"'"
}]'Appendix: Quick Reference Card
rs0 Member Summary
| Server | IP | Port | Role | Votes | Priority |
|---|---|---|---|---|---|
| bms-3 (ns3129867) | 51.68.155.224 | 27017 | PRIMARY/SECONDARY | 1 | 1 |
| bms-2 (ns3087638) | 145.239.133.104 | 27017 | SECONDARY observer | 0 | 0 |
| bms-4 (ns3101999) | 54.36.123.110 | 27017 | ARBITER | 1 | 0 |
Quorum: 2 votes required. bms-3 (1) + bms-4 (1) = majority. If arbiter down, no quorum.
Emergency Contacts
| Issue | First action | Escalate if |
|---|---|---|
| No MongoDB PRIMARY | Check arbiter health first | Arbiter healthy but still no PRIMARY |
| bms-1 disk 100% | Run docker prune | Disk still 100% after cleanup |
| WAHA session down | POST /api/sessions/default/restart | Session fails to restart |
| Atrax GPS stale > 30 min | Check n8n container, trigger workflow | n8n healthy but workflow still fails |
AI-Dev-BMS4 Status Check
# From local workstation — check if nightly ran successfully
ssh root@54.36.123.110 "tail -20 /var/log/bms4-nightly.log"
# Check last cron execution time
ssh root@54.36.123.110 "ls -la /var/log/bms4-nightly.log && grep 'END' /var/log/bms4-nightly.log | tail -3"
# Check active agent sessions on bms-4
# (requires SUPABASE credentials)