03 — Nightly Operations & MongoDB rs0 Maintenance

Status: Design document — 2026-06-14
Scope: AI-Dev-BMS4 agent setup · nightly operations checklist · MongoDB rs0 maintenance plan
Servers covered: bms-4 (54.36.123.110) · bms-3 (51.68.155.224) · bms-2 (145.239.133.104)
Related docs: p4-ovh-bms-4-ns3101999-operations.md · p4-ovh-bms-3-ns3129867-operations.md · p4-ovh-bms-2-ns3087638-operations.md


Table of Contents

Part 1: AI-Dev-BMS4 Agent Design

  1. Agent Overview
  2. Installation Checklist
  3. Issue Pickup Logic
  4. Escalation Rules
  5. Agent Configuration

Part 2: Nightly Operations 6. Nightly Operations Schedule 7. Tier 1 Critical Service Checks 8. Tier 2 Platform Checks 9. Tier 3 Quality Checks 10. Supabase Maintenance 11. n8n Workflow Health 12. Disk Usage Monitoring 13. Security Nightly Checks

Part 3: MongoDB rs0 Maintenance 14. Replica Set Health Dashboard 15. Regular Maintenance Schedule 16. Maintenance Procedures 17. Backup Strategy 18. Failover Runbook 19. Capacity Planning Guide 20. Alert Definitions


Part 1: AI-Dev-BMS4 Agent Design

1. Agent Overview

Role

AI-Dev-BMS4 is the autonomous Claude Code agent deployed on bms-4 (54.36.123.110). Its primary responsibility is nightly infrastructure issue processing — it picks up GitHub issues created by GitHub Actions health checks, attempts automated remediation, and escalates unresolvable problems to human operators.

Position in the Agent Fleet

AgentHostRoleMax Parallel
AI-Dev-IO1vps-i1 (IONOS)et-operational-platform issue processing2–3
AI-Dev-HS1vps-h1 (Hostinger)p24-infra issue pipeline + claude-proxy1–2
AI-Dev-OV1bms-2 (OVH)dev/test workloads4
AI-Dev-BMS4bms-4 (OVH)nightly p24-infra ops + MongoDB maintenance4

AI-Dev-BMS4 is specifically designed for the 02:00–06:00 UTC nightly window when GitHub Actions have generated issues from health checks and infrastructure scans. It runs on the server with the most free RAM (30+ GB free) and disk space (1.7 TB free).

Capabilities

  • Clone and operate on radieu/p24-infra repository (dedicated clone at /home/claude-runner/p24-infra)
  • Run diagnostic commands: curl, docker, mongosh, ssh (read-only diagnostics)
  • Create GitHub issues, add comments, apply labels, and open PRs to dev
  • Query Prometheus metrics API for service health
  • Send Discord notifications for immediate alerts
  • Restart failed Docker containers on bms-4 only (owns its own host)
  • SSH read-only access to vps-i1, vps-h1, bms-2, bms-3 for diagnostics

Constraints

  • Never write to production databases or modify MongoDB data
  • Never restart containers on remote servers (vps-h1, bms-3) — SSH is read-only for diagnostics
  • Never push directly to main — all changes via PR to dev
  • Never expose secret values in GitHub issue comments
  • Always create a recovery path before any action with data loss risk

2. Installation Checklist

bms-4 currently runs as root (OVH bare metal default). The following steps bring it into the same pattern as other agent VPSes.

Step 1 — Create claude-runner user

ssh root@54.36.123.110
 
# Create dedicated user (no password, no sudo by default)
useradd -m -s /bin/bash claude-runner
usermod -aG docker claude-runner   # allow docker commands on this host only
 
# Create SSH directory for agent access
mkdir -p /home/claude-runner/.ssh
chmod 700 /home/claude-runner/.ssh
chown -R claude-runner:claude-runner /home/claude-runner/.ssh
 
# Create .claude directory for credentials
mkdir -p /home/claude-runner/.claude
chown -R claude-runner:claude-runner /home/claude-runner/.claude

Step 2 — Install Claude Code

# Install Node.js 22.x (required by Claude Code)
curl -fsSL https://deb.nodesource.com/setup_22.x | bash -
apt-get install -y nodejs
 
# Install Claude Code globally
npm install -g @anthropic-ai/claude-code
 
# Verify
claude --version
which claude   # expect /usr/bin/claude or /usr/local/bin/claude

Step 3 — Copy OAuth credentials

From local workstation, copy a valid .credentials.json that has both accessToken and refreshToken:

# From local Windows workstation
scp C:\Users\konar\.claude\.credentials.json root@54.36.123.110:/home/claude-runner/.claude/.credentials.json
ssh root@54.36.123.110 "chown claude-runner:claude-runner /home/claude-runner/.claude/.credentials.json && chmod 600 /home/claude-runner/.claude/.credentials.json"

Verify Claude Code authenticates:

su - claude-runner -c "claude --version"

Step 4 — Create SSH key for remote diagnostics

The agent needs read-only SSH access to other servers for diagnostic commands.

# As root on bms-4 — generate key for claude-runner
su - claude-runner -c "ssh-keygen -t ed25519 -f /home/claude-runner/.ssh/id_bms4_agent -C 'ai-dev-bms4@bms-4' -N ''"
 
# Display public key — copy this to authorized_keys on other servers
cat /home/claude-runner/.ssh/id_bms4_agent.pub

Then on each target server, add the public key to the read-only diagnostic user:

# On vps-i1 (IONOS) — add to claude-admin for diagnostic SSH
ssh root@217.154.82.162 "echo '<bms4_agent_pubkey>' >> /home/claude-admin/.ssh/authorized_keys"
 
# On vps-h1 (Hostinger) — same
ssh root@72.60.32.61 "echo '<bms4_agent_pubkey>' >> /home/claude-admin/.ssh/authorized_keys"
 
# On bms-3 — same
ssh ubuntu@51.68.155.224 "echo '<bms4_agent_pubkey>' >> /home/ubuntu/.ssh/authorized_keys"

Configure SSH client to use the correct key per host:

cat > /home/claude-runner/.ssh/config << 'EOF'
Host vps-i1 217.154.82.162
    HostName 217.154.82.162
    User claude-admin
    IdentityFile ~/.ssh/id_bms4_agent
    StrictHostKeyChecking no
    ConnectTimeout 10
 
Host vps-h1 72.60.32.61
    HostName 72.60.32.61
    User root
    IdentityFile ~/.ssh/id_bms4_agent
    StrictHostKeyChecking no
    ConnectTimeout 10
 
Host bms-3 51.68.155.224
    HostName 51.68.155.224
    User ubuntu
    IdentityFile ~/.ssh/id_bms4_agent
    StrictHostKeyChecking no
    ConnectTimeout 10
 
Host bms-2 145.239.133.104
    HostName 145.239.133.104
    User ubuntu
    IdentityFile ~/.ssh/id_bms4_agent
    StrictHostKeyChecking no
    ConnectTimeout 10
EOF
chown claude-runner:claude-runner /home/claude-runner/.ssh/config
chmod 600 /home/claude-runner/.ssh/config

Step 5 — Set up GitHub credentials

# Install gh CLI
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg \
  | dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" \
  | tee /etc/apt/sources.list.d/github-cli.list
apt-get update && apt-get install -y gh
 
# Create env file with GitHub token — populated from .env.local
cat > /home/claude-runner/.claude-env << 'EOF'
export GITHUB_TOKEN=<value from .env.local P24_INFRA_GH_TOKEN>
export GH_TOKEN=$GITHUB_TOKEN
export DISCORD_WEBHOOK_URL=<value from .env.local P24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URL>
export PROMETHEUS_URL=http://217.154.82.162:9090
EOF
chmod 600 /home/claude-runner/.claude-env
chown claude-runner:claude-runner /home/claude-runner/.claude-env

Never store the actual token values in this document. Populate .claude-env from .env.local on the local workstation.

Step 6 — Clone p24-infra repository

# As claude-runner — dedicated clone (NOT /opt/p24-infra which is the deployment copy)
su - claude-runner
 
source ~/.claude-env
git clone https://${GITHUB_TOKEN}@github.com/radieu/p24-infra.git ~/p24-infra
cd ~/p24-infra
git checkout dev
git remote set-url origin https://github.com/radieu/p24-infra.git  # remove token from remote URL

The wrapper script injects the token at runtime.

Step 7 — Create wrapper script

cat > /root/bms4-nightly.sh << 'SCRIPT'
#!/usr/bin/env bash
set -euo pipefail
LOG="/var/log/bms4-nightly.log"
echo "=== bms4-nightly START $(date -u +%Y-%m-%dT%H:%M:%SZ) ===" >> "$LOG"
 
# Run as claude-runner
runuser -u claude-runner -- bash -c '
  source ~/.claude-env
  cd ~/p24-infra
  git remote set-url origin "https://${GITHUB_TOKEN}@github.com/radieu/p24-infra.git"
  git fetch origin dev
  git reset --hard origin/dev
  git remote set-url origin "https://github.com/radieu/p24-infra.git"
  claude --dangerously-skip-permissions -p "/process-issues" \
    --allowedTools "Bash,Read,Edit,Write,Glob,Grep,PowerShell"
' >> "$LOG" 2>&1
 
echo "=== bms4-nightly END $(date -u +%Y-%m-%dT%H:%M:%SZ) ===" >> "$LOG"
SCRIPT
chmod +x /root/bms4-nightly.sh

Step 8 — Create cron job

# Nightly at 02:05 UTC (5 min after GitHub Actions health-check creates issues)
cat > /etc/cron.d/bms4-nightly << 'EOF'
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
 
5 2 * * * root /root/bms4-nightly.sh >> /var/log/bms4-nightly.log 2>&1
EOF
chmod 644 /etc/cron.d/bms4-nightly

Step 9 — Create CLAUDE.md for the agent

See Section 5 — Agent Configuration for the full CLAUDE.md content.

# Place the CLAUDE.md in the clone
cp /opt/p24-infra/docs/evaluation/bms4-agent-CLAUDE.md \
   /home/claude-runner/p24-infra/.claude/CLAUDE.md

Step 10 — Register in GitHub as AI-Dev-BMS4

Following the AI runner provisioning procedure from CLAUDE.md:

  1. Create email routing rule in Cloudflare for ai-dev-bms4@zintegrowana.online
  2. Sign up GitHub account with that email
  3. Add as collaborator to radieu/p24-infra with write permission
  4. Create human-action issue: “Confirm GitHub invitation for AI-Dev-BMS4”

Step 11 — Verify

# Test Claude Code auth
su - claude-runner -c "claude --version && echo AUTH_OK"
 
# Test GitHub auth
su - claude-runner -c "source ~/.claude-env && gh auth status"
 
# Test repo access
su - claude-runner -c "source ~/.claude-env && cd ~/p24-infra && gh issue list --repo radieu/p24-infra --limit 5"
 
# Test SSH diagnostics
su - claude-runner -c "ssh vps-i1 'docker ps --format table 2>/dev/null | head -5'"
 
# Test Discord notification
su - claude-runner -c "source ~/.claude-env && curl -s -X POST \"\$DISCORD_WEBHOOK_URL\" \
  -H 'Content-Type: application/json' \
  -d '{\"content\":\"AI-Dev-BMS4 setup verified — nightly agent ready\"}'"

3. Issue Pickup Logic

Triage Decision Tree

New GitHub issue in radieu/p24-infra
  │
  ├── Labels include "atrax-stale"
  │     → check n8n workflow status on bms-4
  │     → if n8n healthy, attempt workflow restart via API
  │     → if n8n down, escalate immediately (Tier 1)
  │
  ├── Labels include "server-down" / "infra-check-fail"
  │     → identify failed component from issue title
  │     → run targeted diagnostic (curl, ssh, docker ps)
  │     → if container restart would fix: restart it
  │     → if requires human SSH/physical: escalate
  │
  ├── Labels include "failed-gh-actions"
  │     → check workflow logs via gh API
  │     → if transient (timeout, network): re-trigger workflow
  │     → if code/config bug: open fix PR
  │     → if secret expired / runner down: escalate
  │
  ├── Labels include "security"
  │     → never auto-remediate; always escalate to human
  │     → add "human-action" label immediately
  │
  ├── Labels include "triage" only (no routing label)
  │     → apply /process-issues skill for standard triage
  │
  └── All other issues
        → apply /process-issues skill (Design → In Progress → PR)

Issue Resolution Flow

1. CLAIM  — gh issue edit #N --add-label "in-progress"
           — add comment: "AI-Dev-BMS4 picking up this issue [timestamp]"

2. DIAGNOSE — run relevant check commands (see Part 2)
            — record findings in a comment

3. ACT
   ├── Resolvable automatically:
   │     — implement fix (code change, config update, container restart)
   │     — open PR to dev
   │     — comment: "Fix in PR #NN"
   │
   └── Not resolvable:
         — add comment: "Diagnostics complete. Root cause: <description>"
         — add label "human-action"
         — send Discord alert (see Section 4)

4. CLOSE (if fully resolved by PR merge, or label human-action and move on)

Capacity Limit

AI-Dev-BMS4 runs up to 4 parallel Claude Code processes during the nightly window (02:00–06:00 UTC). Use Supabase agent_sessions table with worker_env = 'bms4' to prevent over-claiming:

# Check active sessions before spawning
ACTIVE=$(curl -sf \
  -H "apikey: $SUPABASE_SERVICE_KEY" \
  "$SUPABASE_URL/rest/v1/agent_sessions?status=eq.active&worker_env=eq.bms4&select=count" \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(d[0]['count'])")
if [ "$ACTIVE" -ge 4 ]; then
  echo "At capacity ($ACTIVE/4 agents). Backing off."
  exit 0
fi

4. Escalation Rules

Escalation Triggers

ConditionTierAction
Tier 1 service down (WAHA, Atrax GPS, Pinbox24 API, Supabase)ImmediateDiscord + GitHub issue comment + human-action label
MongoDB rs0 has no PRIMARYImmediateDiscord + GitHub issue
bms-1 disk > 95%ImmediateDiscord + GitHub issue
Security issue (CVE, auth anomaly, credential expiry)ImmediateDiscord + GitHub issue + human-action
Issue cannot be diagnosed (SSH unreachable, auth failure)15 minDiscord warning + GitHub comment
Auto-fix attempted but failed twice30 minDiscord + human-action label
Agent OAuth expiredN/ASystem cannot self-notify — GitHub Actions health-check backstop handles this

Discord Notification Format

# Discord alert function — call this from within the agent or nightly script
send_discord_alert() {
  local SEVERITY="$1"   # "CRITICAL" | "WARNING" | "INFO"
  local TITLE="$2"
  local DESCRIPTION="$3"
  local ISSUE_URL="${4:-}"
 
  case "$SEVERITY" in
    CRITICAL) COLOR=15158332 ;;  # red
    WARNING)  COLOR=16776960 ;;  # yellow
    INFO)     COLOR=3066993  ;;  # green
    *)        COLOR=9807270  ;;  # grey
  esac
 
  PAYLOAD=$(jq -nc \
    --arg title "[$SEVERITY] $TITLE" \
    --arg desc "$DESCRIPTION" \
    --arg url "$ISSUE_URL" \
    --argjson color "$COLOR" \
    '{embeds: [{title: $title, description: $desc, url: $url, color: $color}]}')
 
  curl -s -X POST "$DISCORD_WEBHOOK_URL" \
    -H "Content-Type: application/json" \
    -d "$PAYLOAD" || true
}
 
# Usage examples:
send_discord_alert "CRITICAL" \
  "AI-Dev-BMS4: WAHA session FAILED" \
  "WhatsApp gateway session not WORKING. GPS incidents cannot be received. Manual restart required.\nHost: waha2.vps-h1.infra.zintegrowana.online" \
  "https://github.com/radieu/p24-infra/issues/123"
 
send_discord_alert "WARNING" \
  "AI-Dev-BMS4: bms-3 RAM at 87%" \
  "MongoDB on bms-3 consuming 21.7 GB. Available RAM below safe threshold.\nConsider planned mongod restart during low-traffic window."

GitHub Issue Labeling on Escalation

# When escalating an issue to human:
gh issue edit "$ISSUE_NUMBER" \
  --repo radieu/p24-infra \
  --add-label "human-action"
 
gh issue comment "$ISSUE_NUMBER" \
  --repo radieu/p24-infra \
  --body "$(cat << 'EOF'
## AI-Dev-BMS4 Escalation Report
 
**Timestamp:** $(date -u +%Y-%m-%dT%H:%M:%SZ)
**Agent:** AI-Dev-BMS4 (bms-4, 54.36.123.110)
**Reason for escalation:** <specific reason>
 
### Diagnostics performed
<list of commands run and their outputs>
 
### Root cause assessment
<what the agent determined>
 
### Recommended human action
<specific steps for the human operator>
 
### Cannot proceed because
<specific blocker — e.g., "Requires MongoDB admin password", "Requires physical server access">
EOF
)"

5. Agent Configuration

CLAUDE.md for bms-4 agent

Place at /home/claude-runner/p24-infra/.claude/CLAUDE.md (overrides repo-level for this agent instance):

# CLAUDE.md — AI-Dev-BMS4 Nightly Agent
 
## Agent Identity
- **Label:** AI-Dev-BMS4
- **Host:** bms-4 (54.36.123.110, Ubuntu 22.04)
- **Role:** Nightly p24-infra issue processing + MongoDB maintenance
- **Max parallel agents:** 4
- **Active window:** 02:00–06:00 UTC
 
## Primary Tasks
1. Pick up GitHub issues in `radieu/p24-infra` with labels: `triage`, `server-down`,
   `infra-check-fail`, `atrax-stale`, `failed-gh-actions`
2. Run /process-issues skill for triage/design/implementation
3. MongoDB rs0 health check commands (read-only mongosh)
4. Disk usage checks on all servers via SSH
5. Escalate unsolvable issues via Discord + human-action label
 
## Permissions
- Docker commands: ALLOWED on bms-4 only (this server)
- SSH diagnostics: ALLOWED to vps-i1, vps-h1, bms-2, bms-3 (read-only)
- GitHub PRs: ALWAYS target `dev` branch
- MongoDB: READ-ONLY — `rs.status()`, `db.serverStatus()`, profiler queries only
- NEVER: write to MongoDB, restart containers on remote servers, push to main
 
## Environment
Source `~/.claude-env` before any command requiring GITHUB_TOKEN or DISCORD_WEBHOOK_URL.
 
## Error Reporting
All errors → Discord via DISCORD_WEBHOOK_URL + GitHub issue comment

.env.local secrets required on bms-4

The agent’s .claude-env (at /home/claude-runner/.claude-env) must contain:

VariableSource in .env.localPurpose
GITHUB_TOKENP24_INFRA_GH_TOKENgh CLI authentication
GH_TOKENsamealias for gh
DISCORD_WEBHOOK_URLP24_DISCORD_INFRA_SCRIPTS_ERRORS_WEBHOOK_URLalerts
SUPABASE_URLSUPABASE_URLagent_sessions coordination
SUPABASE_SERVICE_KEYSUPABASE_SERVICE_KEYinsert/read agent_sessions
PROMETHEUS_URLhttp://217.154.82.162:9090metrics queries

The MongoDB admin password is NOT stored on bms-4 — the agent runs only read-only mongosh commands using the keyFile (internal cluster auth), not admin credentials.

systemd service file (alternative to cron)

For more reliable execution than cron:

# /etc/systemd/system/bms4-nightly.service
[Unit]
Description=AI-Dev-BMS4 Nightly p24-infra Issue Processing
After=network.target
 
[Service]
Type=oneshot
User=root
ExecStart=/root/bms4-nightly.sh
StandardOutput=append:/var/log/bms4-nightly.log
StandardError=append:/var/log/bms4-nightly.log
TimeoutStartSec=3600
# /etc/systemd/system/bms4-nightly.timer
[Unit]
Description=AI-Dev-BMS4 Nightly Timer
Requires=bms4-nightly.service
 
[Timer]
OnCalendar=*-*-* 02:05:00 UTC
Persistent=true
 
[Install]
WantedBy=timers.target
systemctl daemon-reload
systemctl enable bms4-nightly.timer
systemctl start bms4-nightly.timer
systemctl list-timers bms4-nightly.timer

Part 2: Nightly Operations

6. Nightly Operations Schedule

All times UTC. AI-Dev-BMS4 orchestrates the sequence starting at 02:05 UTC after the initial GitHub Actions health checks at 02:00.

Time (UTC)TaskResponsibleMethod
02:00Health check — all Tier 1 servicesGitHub Actions health-check.yml (every 2h)GH Actions on ionos runner
02:00Atrax GPS freshness checkGitHub Actions atrax-data-freshness.yml (every 10 min)GH Actions on ionos runner
02:05AI-Dev-BMS4 wakes up — picks up any open issuesAI-Dev-BMS4 cronbms-4
02:10Disk usage check — all 5 serversAI-Dev-BMS4SSH + df -h
02:30MongoDB rs0 health check + replication metricsAI-Dev-BMS4mongosh on bms-3 via SSH
02:30DB maintenance VACUUM (Mon–Sat)GitHub Actions db-maintenance.ymlGH Actions on ionos runner
03:00Docker container audit — all hostsAI-Dev-BMS4SSH + docker ps
03:00DB weekly maintenance (Sunday only — REINDEX)GitHub Actions db-maintenance.ymlGH Actions on ionos runner
03:30SSL certificate expiry checkAI-Dev-BMS4curl + openssl
04:00Trivy CVE scanGitHub Actions trivy-scan.ymlGH Actions on ionos runner
04:00n8n SQLite maintenance (Sunday only)GitHub Actions n8n-maintenance.ymlGH Actions on ionos runner
04:30Security log review (fail2ban, auth attempts)AI-Dev-BMS4SSH + log analysis
05:00Morning summary report — DiscordAI-Dev-BMS4Discord webhook
05:00p24-infra issue pipeline (triage + design)IONOS cron p24-infra-nightly.shvps-i1
05:30Close resolved issues, tag remainingAI-Dev-BMS4gh CLI

Time Budget (bms-4 resources, 32 GB RAM)

PhaseDurationClaude agentsRAM usage
Issue pickup + triage02:05–03:301–42.8–5.6 GB
MongoDB + disk checks02:10–02:400 (scripts)<100 MB
Security + SSL03:30–04:301–21.4–2.8 GB
Report generation04:30–05:301700 MB

Total RAM budget: 30 GB free → well within limits even with 4 agents.


7. Tier 1 Critical Service Checks

These checks run at 02:00 UTC via health-check.yml and every 10 minutes for Atrax. AI-Dev-BMS4 supplements with deeper diagnostics when alerts are raised.

7.1 Atrax GPS Sync (n8n workflow)

What it is: n8n workflow ID AJ1px9uHIfbsriof syncs GPS tracking data from Atrax API to Supabase p24_gps_current_state table every 5 minutes.

Check command (run by GitHub Actions every 10 min):

# Check freshness — stale if last sync > 10 minutes ago
RESPONSE=$(curl -sf \
  -H "apikey: $SUPABASE_SERVICE_KEY" \
  "$SUPABASE_URL/rest/v1/p24_gps_current_state?select=n8n_synced_at&order=n8n_synced_at.desc&limit=1")
LAST=$(echo "$RESPONSE" | jq -r '.[0].n8n_synced_at // empty')
AGE=$(( $(date -u +%s) - $(date -u -d "$LAST" +%s) ))
[ "$AGE" -lt 600 ] && echo "OK: ${AGE}s" || echo "STALE: ${AGE}s"

Success criterion: n8n_synced_at within last 600 seconds (10 min)

Failure action (AI-Dev-BMS4):

# 1. Check n8n container health on bms-4 (post-migration) or vps-h1 (pre-migration)
ssh root@54.36.123.110 "docker inspect root-n8n-1 --format '{{.State.Status}}'"
 
# 2. Check last execution in n8n API
curl -s -H "X-N8N-API-KEY: $N8N_API_KEY" \
  "https://n8n.bms-4.infra.zintegrowana.online/api/v1/workflows/AJ1px9uHIfbsriof/executions?limit=3" \
  | jq '.data[] | {finished, status, startedAt}'
 
# 3. If container down: alert immediately (cannot auto-restart — requires n8n workflow logic)
# 4. If workflow stuck: post GitHub issue with label "atrax-stale"

SLA: Data must not be stale > 30 minutes. Page immediately at 30 min stale.


7.2 Docker Daemon Health — bms-1 and bms-3

What it is: Pinbox24 production (bms-1) and staging (bms-3) run on Docker. Daemon down = all containers down.

Check commands:

# bms-1 (Pinbox24 production) — SSH as root
ssh -i ~/.ssh/id_bms4_agent -o ConnectTimeout=10 root@94.23.26.113 \
  "systemctl is-active docker && docker ps --format '{{.Names}}' | wc -l"
 
# bms-3 (Pinbox24 staging + MongoDB primary)
ssh ubuntu@51.68.155.224 \
  "systemctl is-active docker && docker ps --format '{{.Names}}' | grep -c 'Up'"

Success criterion: docker service = active, container count matches expected (bms-1: ~24, bms-3: ~11)

Failure action: Immediate Discord CRITICAL alert. Cannot auto-remediate — escalate with human-action label.


7.3 WAHA WhatsApp Gateway

What it is: WAHA container on vps-h1 receives WhatsApp messages for incident management. Session must be in WORKING state (not just container running).

Check command:

# Server liveness + session state (both required)
SERVER=$(curl -s -o /dev/null -w "%{http_code}" --max-time 15 \
  -H "X-Api-Key: $WAHA_API_KEY" \
  "https://waha2.vps-h1.infra.zintegrowana.online/api/server/status")
 
SESS_STATUS=$(curl -s --max-time 15 \
  -H "X-Api-Key: $WAHA_API_KEY" \
  "https://waha2.vps-h1.infra.zintegrowana.online/api/sessions/default" \
  | python3 -c 'import sys,json; print(json.load(sys.stdin).get("status","?"))' 2>/dev/null)
 
echo "server=$SERVER session=$SESS_STATUS"
[ "$SERVER" = "200" ] && [ "$SESS_STATUS" = "WORKING" ] && echo "OK" || echo "FAIL"

Success criterion: HTTP 200 AND session status = WORKING

Important: Server returning HTTP 200 does NOT mean session is healthy (proven during 2026-05-23 blackout). Always check session status separately.

Failure action:

  • If container down: check ssh root@72.60.32.61 "docker ps | grep waha"
  • If session FAILED/STOPPED: attempt session restart via WAHA API: POST /api/sessions/default/restart
  • If restart fails: trigger waha-session-restart.yml via gh workflow run
  • If persistent: escalate with human-action — may require re-scan of QR code

7.4 MongoDB rs0 Replica Set Health

What it is: Three-member replica set (bms-3 PRIMARY, bms-2 SECONDARY observer, bms-4 ARBITER). Requires at least 2 of 3 members for election quorum.

Check command:

# Run via SSH to bms-3 (which has cluster auth access)
# Uses keyFile internal auth — no admin password needed for rs.status()
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  const s = rs.status();
  const members = s.members.map(m => ({name: m.name, state: m.stateStr, health: m.health}));
  printjson({set: s.set, ok: s.ok, members: members});
'"

Success criteria:

  • Exactly 1 member in PRIMARY state
  • health: 1 for all members
  • Replication lag on bms-2 < 60 seconds
  • No member in RECOVERING, DOWN, or UNKNOWN state

Failure action: See Section 18 — Failover Runbook


7.5 Pinbox24 Production API

What it is: The production fleet management API endpoint.

Check command:

STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  --connect-timeout 10 --max-time 20 \
  "https://api.w4.pinbox24.com/api/")
[ "$STATUS" = "200" ] || [ "$STATUS" = "401" ] \
  && echo "API reachable (HTTP $STATUS)" \
  || echo "API FAIL (HTTP $STATUS)"

Success criterion: HTTP 200 or 401 (401 = API running but unauthenticated, which is correct)

Failure action: Immediate Discord CRITICAL. Check bms-1 Docker daemon. Cannot auto-remediate.


7.6 Supabase Connectivity and Queue Depths

What it is: Supabase is the primary database and coordination hub. Queue depth metrics reveal processing backlogs.

Check command:

# Connectivity
HTTP=$(curl -s -o /dev/null -w "%{http_code}" \
  "$SUPABASE_URL/rest/v1/agent_sessions?select=count&limit=0" \
  -H "apikey: $SUPABASE_SERVICE_KEY" --max-time 10)
echo "Supabase HTTP: $HTTP"
 
# Queue depths — check for backlogs
curl -sf \
  -H "apikey: $SUPABASE_SERVICE_KEY" \
  "$SUPABASE_URL/rest/v1/pending_transcriptions?select=count" \
  | jq '.[0].count // 0' | xargs -I{} echo "pending_transcriptions: {}"
 
curl -sf \
  -H "apikey: $SUPABASE_SERVICE_KEY" \
  "$SUPABASE_URL/rest/v1/pending_pdf_processing?select=count" \
  | jq '.[0].count // 0' | xargs -I{} echo "pending_pdf_processing: {}"

Thresholds: pending_transcriptions > 100 or pending_pdf_processing > 50 = warning alert.


7.7 Disk Usage — Critical Servers

What it is: bms-1 is already at 100% disk capacity. All servers need monitoring.

Check commands:

# bms-1 (Pinbox24 production) — CRITICAL (already at 100%)
ssh root@94.23.26.113 "df -h / | tail -1"
 
# bms-3 (staging + MongoDB primary)
ssh ubuntu@51.68.155.224 "df -h / | tail -1"
 
# vps-i1 (monitoring)
ssh claude-admin@217.154.82.162 "df -h / | tail -1"
 
# vps-h1 (n8n + WAHA)
ssh root@72.60.32.61 "df -h / | tail -1"
 
# bms-4 (self — this server)
df -h / | tail -1

Thresholds:

ServerWarningCriticalAction
bms-190%95%Immediate escalation — disk already full
bms-370%80%Docker prune + old log cleanup
vps-i180%90%Docker image prune
vps-h175%85%Docker prune
bms-460%75%n8n data + Docker prune

Cleanup commands for bms-3 (auto-applicable):

ssh ubuntu@51.68.155.224 "
  # Remove unused Docker images (keeps running containers)
  docker image prune -f
  # Remove stopped containers and unused networks
  docker container prune -f
  # Truncate old logs (keep last 100 MB)
  find /var/log -name '*.log' -size +100M -exec truncate -s 100M {} \;
"

8. Tier 2 Platform Checks

These run nightly by AI-Dev-BMS4 at 03:00–04:00 UTC. Failure triggers a 30-minute response window.

8.1 et-operational-platform Vercel Health

# Check production deployment
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  --max-time 15 "${VERCEL_ETOP_PROD_URL}/api/health" 2>/dev/null \
  || echo "000")
echo "Vercel production: HTTP $STATUS"
 
# Check for recent failed deployments via Vercel API
curl -sf "https://api.vercel.com/v6/deployments?limit=5&target=production" \
  -H "Authorization: Bearer $VERCEL_TOKEN" \
  | jq '.deployments[] | {url, state, created: (.createdAt | todate)}'

Thresholds: Any ERROR state deployment in last 24h → warning issue.


8.2 Grafana + Prometheus on vps-i1

# Grafana
curl -sf --max-time 10 \
  "https://grafana.vps-i1.infra.zintegrowana.online/api/health" \
  | jq .database
 
# Prometheus
curl -sf --max-time 10 \
  "http://217.154.82.162:9090/-/healthy" && echo "Prometheus healthy"
 
# Check Prometheus targets (any DOWN?)
curl -sf "http://217.154.82.162:9090/api/v1/targets" \
  | jq '[.data.activeTargets[] | select(.health != "up")] | length' \
  | xargs -I{} echo "Unhealthy targets: {}"

8.3 n8n Workflow Execution Failures

# Check recent executions — failures in last 24h
curl -sf \
  -H "X-N8N-API-KEY: $N8N_API_KEY" \
  "https://n8n.bms-4.infra.zintegrowana.online/api/v1/executions?status=error&limit=20" \
  | jq '.data | length' | xargs -I{} echo "Failed executions (24h): {}"
 
# Get workflow names of failures
curl -sf \
  -H "X-N8N-API-KEY: $N8N_API_KEY" \
  "https://n8n.bms-4.infra.zintegrowana.online/api/v1/executions?status=error&limit=10" \
  | jq '.data[] | {workflow: .workflowData.name, startedAt}'

Threshold: > 5 failed executions in 24h → GitHub issue with label n8n-errors.


8.4 SSL Certificate Expiry

# Check all public-facing domains — warn if < 14 days
DOMAINS=(
  "grafana.vps-i1.infra.zintegrowana.online"
  "n8n.bms-4.infra.zintegrowana.online"
  "waha2.vps-h1.infra.zintegrowana.online"
  "traccar.vps-i1.infra.zintegrowana.online"
)
 
for domain in "${DOMAINS[@]}"; do
  EXPIRY=$(echo | openssl s_client -servername "$domain" -connect "${domain}:443" 2>/dev/null \
    | openssl x509 -noout -enddate 2>/dev/null \
    | cut -d= -f2)
  DAYS=$(( ( $(date -d "$EXPIRY" +%s) - $(date +%s) ) / 86400 ))
  if [ "$DAYS" -lt 14 ]; then
    echo "WARNING: $domain cert expires in $DAYS days"
  else
    echo "OK: $domain expires in $DAYS days"
  fi
done

Threshold: < 14 days → warning issue. < 7 days → CRITICAL, immediate Discord alert.


8.5 Memory Usage — bms-3 MongoDB Risk

# bms-3 is most at risk: MongoDB using 21.7 GB of 32 GB total
ssh ubuntu@51.68.155.224 "
  echo '--- Memory ---'
  free -h
  echo '--- MongoDB process ---'
  ps aux --sort=-%mem | grep mongod | head -3
  echo '--- Docker containers ---'
  docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}' | head -15
"

Threshold: Available memory < 4 GB on bms-3 → WARNING (MongoDB OOM risk).


8.6 Traefik Routing Health on bms-4

# Check Traefik dashboard API (not public)
ssh root@54.36.123.110 "
  docker exec root-traefik-1 wget -qO- http://localhost:8080/api/rawdata 2>/dev/null \
  | python3 -c \"import sys,json; d=json.load(sys.stdin); print('routers:', len(d.get('routers', {})))\"
  docker inspect root-traefik-1 --format '{{.State.Status}}'
"

9. Tier 3 Quality Checks

These run nightly and generate items for the morning report. No immediate alerting — create GitHub issues for tracking.

9.1 Supabase Slow Query Report

# Top 10 slowest queries by mean execution time
psql "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF sslmode=require dbname=postgres" \
  -c "SELECT
    LEFT(query, 80) AS query_snippet,
    calls,
    ROUND(mean_exec_time::numeric, 2) AS mean_ms,
    ROUND(total_exec_time::numeric, 2) AS total_ms
  FROM pg_stat_statements
  WHERE mean_exec_time > 100
  ORDER BY mean_exec_time DESC
  LIMIT 10;"

9.2 Docker Image CVE Summary

Handled by trivy-scan.yml at 04:00 UTC. AI-Dev-BMS4 checks for open Trivy issues:

gh issue list --repo radieu/p24-infra --label security --state open \
  --json number,title,createdAt | jq '.[] | select(.title | contains("Trivy"))'

9.3 Backup Freshness Verification

# Check Wasabi backup status JSON (written by backup-exporter)
curl -sf "http://217.154.82.162:9220/metrics" \
  | grep "backup_last_successful_timestamp" \
  | awk '{print $1, strftime("%Y-%m-%d %H:%M", $2)}'
 
# Check n8n backup freshness (last Sunday)
python3 - << 'EOF'
import boto3, certifi, datetime, os
s3 = boto3.client("s3",
    endpoint_url="https://s3.eu-central-2.wasabisys.com",
    region_name="eu-central-2",
    verify=certifi.where()
)
objs = s3.list_objects_v2(Bucket="p24-infra", Prefix="n8n/")
if objs.get("Contents"):
    latest = max(objs["Contents"], key=lambda x: x["LastModified"])
    age = (datetime.datetime.now(datetime.timezone.utc) - latest["LastModified"]).days
    print(f"n8n backup: {latest['Key']} ({age} days ago)")
EOF

9.4 GitHub Actions CI/CD Pipeline Health

# Check failed scheduled workflows in last 24h
gh run list --repo radieu/p24-infra \
  --status failure \
  --created "$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --json workflowName,createdAt,url \
  | jq '.[] | {workflow: .workflowName, time: .createdAt, url}'

9.5 MongoDB Slow Query Log

# Check MongoDB profiler on bms-3 for queries > 100ms (last 24h)
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  db = db.getSiblingDB(\"admin\");
  // Check profiler level (should be 1 = slow ops only, or 2 = all)
  print(\"profiler:\", JSON.stringify(db.getProfilingStatus()));
  
  // Query system.profile for recent slow operations
  db.getSiblingDB(\"local\").system.profile.find(
    { millis: { \$gte: 100 }, ts: { \$gte: new Date(Date.now() - 86400000) } },
    { ns: 1, millis: 1, ts: 1, op: 1 }
  ).sort({ millis: -1 }).limit(10).forEach(printjson);
'"

10. Supabase Maintenance

Nightly VACUUM (02:30 UTC, Mon–Sat)

Handled by db-maintenance.yml. This workflow connects via the Supabase pooler (IPv4-compatible) on port 5432 (session mode — required for VACUUM).

# Command reference for manual execution:
PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
  "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" \
  -c "VACUUM ANALYZE;"

Weekly REINDEX (Sunday 03:00 UTC)

# REINDEX CONCURRENTLY — does not block reads/writes
PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
  "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" \
  -c "REINDEX DATABASE CONCURRENTLY postgres;"

Monthly Stats Reset

# Reset pg_stat_statements counters monthly to avoid stale data
PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
  "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" \
  -c "SELECT pg_stat_statements_reset();"

Table Bloat Check

PGPASSWORD="$SUPABASE_DB_PASSWORD" psql \
  "host=$SUPABASE_DB_HOST user=postgres.$SUPABASE_PROJECT_REF port=5432 sslmode=require dbname=postgres" << 'SQL'
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
  pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)
    - pg_relation_size(schemaname||'.'||tablename)) AS index_size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
SQL

11. n8n Workflow Health

Key Workflows to Monitor

WorkflowIDScheduleCriticality
Atrax GPS SyncAJ1px9uHIfbsriofEvery 5 minTier 1
WAHA incident routing(varies)Event-triggeredTier 1
Daily GPS report(varies)06:00 UTCTier 2
Supabase queue drain(varies)ContinuousTier 2

n8n Health Check Script

N8N_BASE="https://n8n.bms-4.infra.zintegrowana.online"
 
# Container health
ssh root@54.36.123.110 "docker inspect --format '{{.State.Health.Status}}' root-n8n-1"
 
# API health endpoint
curl -sf "${N8N_BASE}/healthz" && echo "n8n healthy"
 
# Active workflows count
curl -sf -H "X-N8N-API-KEY: $N8N_API_KEY" \
  "${N8N_BASE}/api/v1/workflows?active=true" \
  | jq '.data | length' | xargs -I{} echo "Active workflows: {}"
 
# Recent failures
curl -sf -H "X-N8N-API-KEY: $N8N_API_KEY" \
  "${N8N_BASE}/api/v1/executions?status=error&limit=5" \
  | jq '.data[] | {workflow: .workflowData.name, startedAt, status}'

Atrax Workflow Restart Procedure

If Atrax GPS sync is stale but n8n is healthy:

# Manually trigger the Atrax GPS sync workflow via n8n API
curl -X POST \
  -H "X-N8N-API-KEY: $N8N_API_KEY" \
  -H "Content-Type: application/json" \
  "${N8N_BASE}/api/v1/workflows/AJ1px9uHIfbsriof/activate" \
  -d '{}'
 
# If workflow is already active but not executing, trigger a test run
curl -X POST \
  -H "X-N8N-API-KEY: $N8N_API_KEY" \
  -H "Content-Type: application/json" \
  "${N8N_BASE}/api/v1/workflows/AJ1px9uHIfbsriof/test" \
  -d '{"pinData":{}}'

12. Disk Usage Monitoring

Per-Server Thresholds and Cleanup Procedures

bms-1 (Pinbox24 production — CRITICAL, already at 100%)

ssh root@94.23.26.113 "
  echo '=== Disk usage ==='
  df -h /
  echo '=== Top space consumers ==='
  du -sh /var/lib/docker /var/log /tmp 2>/dev/null | sort -rh | head -10
  echo '=== Docker disk usage ==='
  docker system df
"

Auto-cleanup (safe on bms-1):

ssh root@94.23.26.113 "
  # Remove stopped containers (not running Pinbox24)
  docker container prune -f
  # Remove dangling images only (NOT all unused — avoid removing p24 images)
  docker image prune -f
  # Truncate large log files
  find /var/log -name '*.log' -size +50M -exec truncate -s 10M {} \;
  # Clear /tmp
  find /tmp -mtime +7 -delete 2>/dev/null || true
"

Alert threshold: At 100% — any write failure could take down production. Alert at ANY usage > 95%.

bms-3 (staging + MongoDB)

ssh ubuntu@51.68.155.224 "
  df -h /
  du -sh /var/lib/mongodb /var/lib/docker 2>/dev/null
"

Auto-cleanup:

ssh ubuntu@51.68.155.224 "
  docker image prune -f
  docker container prune -f
  sudo journalctl --vacuum-time=7d
"

Alert threshold: > 70% warning, > 80% critical.

bms-4 (this server)

df -h /
du -sh /var/lib/docker /var/lib/mongodb /home/claude-runner 2>/dev/null
docker system df

Alert threshold: > 60% warning (currently at 1% — 1.7 TB free).


13. Security Nightly Checks

13.1 fail2ban Status

# Check banned IPs on all servers
for host in root@94.23.26.113 ubuntu@51.68.155.224 root@54.36.123.110 root@72.60.32.61; do
  echo "=== $host ==="
  ssh -o ConnectTimeout=10 "$host" \
    "fail2ban-client status sshd 2>/dev/null | grep -E 'Total banned|Currently banned'" \
    2>/dev/null || echo "fail2ban not running or unreachable"
done

Alert threshold: > 50 new bans in 24h = potential brute-force attack in progress.

13.2 SSH Auth Log Review

# Failed SSH login attempts (last 24h) on each server
for host in ubuntu@51.68.155.224 root@54.36.123.110; do
  echo "=== $host failed SSH ==="
  ssh -o ConnectTimeout=10 "$host" \
    "grep 'Failed password\|Invalid user\|Authentication failure' /var/log/auth.log 2>/dev/null \
     | grep \"$(date -u --date='24 hours ago' '+%b %d')\\|$(date -u '+%b %d')\" \
     | wc -l" 2>/dev/null | xargs -I{} echo "Failed attempts: {}"
done

Alert threshold: > 200 failed attempts/24h on any single server.

13.3 SSL Certificate Expiry

Covered in Section 8.4. Run nightly — auto-creates GitHub issue if < 14 days.

13.4 Credential Rotation Check

# Check credential-exporter metrics for rotation age
curl -sf "http://217.154.82.162:9210/metrics" \
  | grep "credential_age_days" \
  | sort -t= -k2 -rn \
  | head -10

Alert threshold: Any credential older than 80 days (rotation should be every 90 days).

13.5 MongoDB Unauthorized Access Attempts

# Check MongoDB logs for auth failures (bms-3 — likely PRIMARY)
ssh ubuntu@51.68.155.224 "
  grep -i 'authentication\|authorization.*failed\|Unauthorized' \
    /var/log/mongodb/mongod.log \
    | tail -50 \
    | grep \"$(date -u '+%Y-%m-%d')\"
"

Part 3: MongoDB rs0 Maintenance

14. Replica Set Health Dashboard

Key Prometheus Metrics (PromQL)

The following metrics are scraped via the future mongodb-exporter integration. Until that exporter is deployed, use the mongosh commands below.

Replication lag (secondary behind primary):

# Once mongodb-exporter is deployed:
mongodb_rs_member_optime_date{state="SECONDARY"} - on() mongodb_rs_member_optime_date{state="PRIMARY"}
 
# Threshold alert: lag > 60s

Oplog window (how long before secondary falls off the oplog):

mongodb_mongod_replset_oplog_head_timestamp - mongodb_mongod_replset_oplog_tail_timestamp
# Healthy: > 24h (86400s)
# Warning: < 4h

Active connections:

mongodb_connections{state="current"}
# Alert if > 80% of maxConnections

RAM usage (node-level proxy metric):

# Until mongodb-exporter available, use node_memory_MemAvailable_bytes on bms-3
node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"} / 1024^3
# Alert: available < 4 GB

mongosh Health Commands

Run these on bms-3 (PRIMARY) via SSH. Uses internal cluster auth (keyFile) — no admin password needed for status queries:

// Connect to bms-3 as cluster member (keyFile auth handles internal auth automatically)
// Run: ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '...'"
 
// === FULL REPLICA SET STATUS ===
rs.status()
 
// === CONCISE MEMBER STATUS ===
rs.status().members.forEach(m => {
  const lag = m.lastHeartbeatMessage || '';
  const optime = m.optime ? m.optime.ts : 'N/A';
  print(`${m.name} | ${m.stateStr} | health:${m.health} | ${optime}`);
});
 
// === REPLICATION LAG ===
// Get PRIMARY and SECONDARY optimes, compute lag in seconds
const status = rs.status();
const primary = status.members.find(m => m.stateStr === 'PRIMARY');
const secondaries = status.members.filter(m => m.stateStr === 'SECONDARY');
secondaries.forEach(s => {
  const lag = primary.optimeDate - s.optimeDate;
  print(`${s.name} lag: ${lag / 1000}s`);
});
 
// === OPLOG WINDOW ===
use local;
const oplog = db.oplog.rs;
const head = oplog.find().sort({$natural: -1}).limit(1).next();
const tail = oplog.find().sort({$natural: 1}).limit(1).next();
const windowHours = (head.ts.t - tail.ts.t) / 3600;
print(`Oplog window: ${windowHours.toFixed(1)} hours`);
print(`Oplog size: ${db.runCommand({dbStats: 1, freeStorage: 0}).storageSize / 1024 / 1024 / 1024} GB`);
 
// === CONNECTION STATS ===
use admin;
const cs = db.serverStatus().connections;
print(`Connections: current=${cs.current} available=${cs.available} totalCreated=${cs.totalCreated}`);
 
// === SLOW QUERIES (profiler, last 100) ===
use admin;
db.setProfilingLevel(1, { slowms: 100 });  // Set if not already set
db.getSiblingDB("local").system.profile.find(
  { millis: { $gte: 100 } }
).sort({ ts: -1 }).limit(20).forEach(p => {
  print(`${p.ts.toISOString()} | ${p.op} | ${p.ns} | ${p.millis}ms`);
});

Grafana Dashboard Panels (to create)

PanelMetric/QueryVisualization
rs0 member statesrs.status() via scraperStatus map (3 nodes)
Replication lagmongodb_rs_member_optime_date diffTime series
Oplog windowhead - tail timestampStat panel
bms-3 RAM availablenode_memory_MemAvailable_bytesGauge
bms-3 disk usagenode_filesystem_avail_bytesGauge
MongoDB connectionsmongodb_connections{state="current"}Time series

15. Regular Maintenance Schedule

Every 15 Minutes (continuous monitoring)

Handled by Prometheus + Alertmanager (automated). No agent action needed unless alert fires.

  • rs.status() health check (via future mongodb-exporter scrape)
  • Replication lag check
  • Connection count check

Daily (02:30 UTC — nightly ops window)

AI-Dev-BMS4 executes:

  1. Quick rs.status() check via SSH
  2. Check replication lag < 60 seconds
  3. Verify all 3 members visible and healthy
  4. Check MongoDB log for errors in last 24h
  5. Verify arbiter (bms-4) is in ARBITER state (not DOWN)
# Daily MongoDB health check script
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  const s = rs.status();
  const healthy = s.members.filter(m => m.health === 1).length;
  const primary = s.members.find(m => m.stateStr === \"PRIMARY\");
  if (!primary) { print(\"CRITICAL: No PRIMARY\"); quit(1); }
  if (healthy < 2) { print(\"CRITICAL: Only \" + healthy + \" healthy members\"); quit(1); }
  print(\"rs0 OK: PRIMARY=\" + primary.name + \" healthy=\" + healthy + \"/\" + s.members.length);
'"

Weekly (Sunday 03:00 UTC)

  1. Oplog size and window analysis
  2. Index statistics — identify unused indexes
  3. Collection statistics — size, count, fragmentation
  4. Profiler slow query report
  5. Review MongoDB error log summary
# Weekly MongoDB analysis — run on bms-3 as PRIMARY
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  use local;
  const oplog = db.oplog.rs;
  const head = oplog.find().sort({$natural: -1}).limit(1).next();
  const tail = oplog.find().sort({$natural: 1}).limit(1).next();
  const windowHours = (head.ts.t - tail.ts.t) / 3600;
  print(\"=== OPLOG ===\");
  print(\"Window: \" + windowHours.toFixed(1) + \" hours\");
  const stats = db.runCommand({dbStats: 1});
  print(\"Oplog size: \" + (stats.storageSize / 1024/1024/1024).toFixed(2) + \" GB\");
  print(\"\");
  
  print(\"=== DATABASES ===\");
  db.adminCommand({listDatabases: 1}).databases.forEach(d => {
    print(d.name + \": \" + (d.sizeOnDisk / 1024/1024).toFixed(0) + \" MB\");
  });
'"

Monthly (1st Sunday of month, 03:00 UTC)

  1. Review replica set configuration — member priorities and votes
  2. Check keyFile md5 consistency across all members
  3. Rolling restart assessment (needed if mongod version update pending)
  4. Index optimization — drop unused indexes, rebuild fragmented ones
  5. Capacity forecast — disk and RAM trend analysis
  6. Review and rotate monitoring credentials

16. Maintenance Procedures

16.1 MongoDB Version Check and Update Assessment

# Check versions on all members
for host in "ubuntu@51.68.155.224" "ubuntu@145.239.133.104" "root@54.36.123.110"; do
  echo "=== $host ==="
  ssh -o ConnectTimeout=10 "$host" "mongod --version 2>/dev/null | head -1" 2>/dev/null
done

Current versions:

  • bms-3: 7.0.26
  • bms-2: 7.0.25
  • bms-4: 7.0.37

Minor version skew is acceptable within the 7.0.x series. Plan rolling update when any member falls 2+ minor versions behind the current 7.0.x stable.

16.2 Rolling Member Restart (for mongod updates)

Never restart all members simultaneously. Always follow this order to maintain service:

  1. Restart SECONDARY member first (bms-2 observer — least impact)
  2. Wait for SECONDARY to rejoin and catch up (lag = 0)
  3. Restart ARBITER (bms-4) — no data, fast restart
  4. Step down PRIMARY (bms-3) — triggers election
  5. Wait for new PRIMARY election
  6. Restart former PRIMARY (bms-3)
# Step 1: Restart bms-2 (SECONDARY observer)
ssh ubuntu@145.239.133.104 "sudo systemctl restart mongod"
# Wait for it to rejoin
sleep 30
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  const bms2 = rs.status().members.find(m => m.name.includes(\"145.239\"));
  print(bms2.name, bms2.stateStr, \"lag:\", (new Date() - bms2.optimeDate) / 1000 + \"s\");
'"
 
# Step 2: Restart bms-4 (ARBITER — only if joined)
ssh root@54.36.123.110 "systemctl restart mongod"
sleep 10
 
# Step 3: Step down bms-3 PRIMARY
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval 'rs.stepDown(60)'"
# Wait for election (typically < 12s)
sleep 15
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  const p = rs.status().members.find(m => m.stateStr === \"PRIMARY\");
  if (p) print(\"New PRIMARY:\", p.name); else print(\"No PRIMARY yet — wait\");
'"
 
# Step 4: Restart former PRIMARY (bms-3)
ssh ubuntu@51.68.155.224 "sudo systemctl restart mongod"
sleep 30
 
# Step 5: Verify rs0 fully healthy
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  rs.status().members.forEach(m => print(m.name, m.stateStr, \"health:\", m.health));
'"

16.3 Index Optimization

// Identify unused indexes (run on PRIMARY during weekly maintenance)
// Run: mongosh --quiet --eval '...' on bms-3
 
// For each database, check index usage stats
db.adminCommand({listDatabases: 1}).databases
  .filter(d => !['admin', 'local', 'config'].includes(d.name))
  .forEach(d => {
    const db2 = db.getSiblingDB(d.name);
    db2.getCollectionNames().forEach(coll => {
      db2.runCommand({aggregate: coll, pipeline: [
        {$indexStats: {}},
        {$match: {'accesses.ops': {$lt: 5}}}  // Used < 5 times
      ], cursor: {}}).cursor.firstBatch.forEach(idx => {
        print(`UNUSED: ${d.name}.${coll} index: ${JSON.stringify(idx.key)} ops:${idx.accesses.ops}`);
      });
    });
  });

16.4 Oplog Size Adjustment

The default oplog size on Ubuntu/MongoDB 7.0 is ~5% of free disk space or 990 MB minimum. For bms-3 (410 GB disk, ~170 GB used), this should be ~10 GB.

// Check current oplog configuration
use local;
db.oplog.rs.stats().maxSize / 1024 / 1024 / 1024 + " GB"
 
// Resize oplog if needed (requires PRIMARY, MongoDB 3.6+)
// WARNING: This requires mongod config change + restart
// Add to /etc/mongod.conf:
//   replication:
//     oplogSizeMB: 10240  # 10 GB
// Then restart mongod

Recommended oplog size: 10 GB on bms-3 (covers ~72h of operations at current write volume).


17. Backup Strategy

Backup Architecture

Backup typeSourceDestinationScheduleRetention
mongodump (full)bms-3 (PRIMARY)Wasabi p24-infra/mongodb/Weekly, Sunday 01:00 UTC4 weeks
mongodump (incremental/oplog)bms-3 (PRIMARY)Wasabi p24-infra/mongodb/oplog/Daily, 01:00 UTC7 days
bms-2 disk snapshotbms-2 (observer)OVH snapshot APIMonthly2 snapshots

Why backup from PRIMARY (bms-3) not from SECONDARY? bms-2 is designated as observer and dev environment host — its MongoDB data is kept current but not treated as the backup source. Backups run from PRIMARY to ensure the most up-to-date data.

Why not backup from arbiter (bms-4)? Arbiters store no data.

Daily Oplog Backup Script

# /root/mongodb-oplog-backup.sh on bms-3
#!/usr/bin/env bash
set -euo pipefail
DATE=$(date -u +%Y-%m-%d)
BACKUP_DIR="/tmp/mongodb-oplog-${DATE}"
S3_KEY="mongodb/oplog/oplog-${DATE}.tar.gz"
 
# Dump oplog only (last 24h of operations)
mongodump \
  --host 127.0.0.1:27017 \
  --authenticationDatabase admin \
  -u admin -p "$MONGODB_ADMIN_PASSWORD" \
  --db local \
  --collection oplog.rs \
  --query '{"ts": {"$gte": Timestamp('"$(date -u --date='25 hours ago' +%s)"', 0)}}' \
  --out "$BACKUP_DIR"
 
# Compress
tar czf "/tmp/${DATE}-oplog.tar.gz" -C "$BACKUP_DIR" .
rm -rf "$BACKUP_DIR"
 
# Upload to Wasabi
python3 - << PYEOF
import boto3, certifi, os
s3 = boto3.client("s3",
    endpoint_url="https://s3.eu-central-2.wasabisys.com",
    region_name="eu-central-2",
    verify=certifi.where()
)
with open(f"/tmp/${DATE}-oplog.tar.gz", "rb") as f:
    s3.upload_fileobj(f, "p24-infra", "${S3_KEY}")
print(f"Uploaded: s3://p24-infra/${S3_KEY}")
PYEOF
 
rm -f "/tmp/${DATE}-oplog.tar.gz"

Weekly Full Backup Script

# /root/mongodb-full-backup.sh on bms-3
#!/usr/bin/env bash
set -euo pipefail
DATE=$(date -u +%Y-%m-%d)
BACKUP_DIR="/tmp/mongodb-full-${DATE}"
S3_KEY="mongodb/full/mongodb-full-${DATE}.tar.gz"
 
# Full dump of all databases
mongodump \
  --host 127.0.0.1:27017 \
  --authenticationDatabase admin \
  -u admin -p "$MONGODB_ADMIN_PASSWORD" \
  --oplog \
  --out "$BACKUP_DIR"
 
# Compress
tar czf "/tmp/${DATE}-full.tar.gz" -C "$BACKUP_DIR" .
BACKUP_SIZE=$(du -sh "/tmp/${DATE}-full.tar.gz" | cut -f1)
rm -rf "$BACKUP_DIR"
 
# Upload to Wasabi
python3 - << PYEOF
import boto3, certifi, os, datetime
s3 = boto3.client("s3",
    endpoint_url="https://s3.eu-central-2.wasabisys.com",
    region_name="eu-central-2",
    verify=certifi.where()
)
with open(f"/tmp/${DATE}-full.tar.gz", "rb") as f:
    s3.upload_fileobj(f, "p24-infra", "${S3_KEY}")
print(f"Uploaded: s3://p24-infra/${S3_KEY} (${BACKUP_SIZE})")
PYEOF
 
rm -f "/tmp/${DATE}-full.tar.gz"
echo "Weekly backup complete: $S3_KEY ($BACKUP_SIZE)"

Backup Verification (Weekly)

# Verify last backup is restorable — extract to temp location and run mongod --dbpath
DATE=$(date -u --date='last Sunday' +%Y-%m-%d)
S3_KEY="mongodb/full/mongodb-full-${DATE}.tar.gz"
 
# Download and verify (bms-2 has spare disk and RAM)
ssh ubuntu@145.239.133.104 "
  python3 -c \"
import boto3, certifi
s3 = boto3.client('s3', endpoint_url='https://s3.eu-central-2.wasabisys.com',
    region_name='eu-central-2', verify=certifi.where())
s3.download_file('p24-infra', '${S3_KEY}', '/tmp/backup-verify.tar.gz')
print('Download OK:', s3.head_object(Bucket='p24-infra', Key='${S3_KEY}')['ContentLength'], 'bytes')
\"
  # Extract and verify structure
  tar tzf /tmp/backup-verify.tar.gz | head -20
  rm -f /tmp/backup-verify.tar.gz
  echo 'Backup structure verified'
"

18. Failover Runbook

18.1 Planned Failover (Maintenance Step-down)

Use this when taking bms-3 (PRIMARY) offline for maintenance.

Prerequisites:

  • Confirm bms-2 (SECONDARY) is caught up (lag = 0)
  • Confirm bms-4 (ARBITER) is healthy
  • Notify team via Discord before starting
// 1. Check pre-conditions
rs.status().members.forEach(m => print(m.name, m.stateStr, "health:", m.health));
 
// 2. Force step-down (bms-3 steps down for 120s, forcing election)
// Run on bms-3:
rs.stepDown(120);
 
// 3. Verify new PRIMARY elected
// Run on bms-2 (it may become PRIMARY, or stay SECONDARY if votes are insufficient)
// Note: bms-2 is non-voting (votes:0) — this means WITH ARBITER DOWN, 
// bms-3 and bms-2 cannot elect a new primary (need a voting member)
// bms-4 ARBITER provides the tie-breaking vote
rs.status();

Important quorum consideration: rs0 has:

  • bms-3: 1 vote (PRIMARY or SECONDARY)
  • bms-2: 0 votes (observer — non-voting)
  • bms-4: 1 vote (ARBITER)

Total votes: 2. Majority needed: 2. If arbiter (bms-4) is DOWN, bms-3 cannot hold PRIMARY (loses majority = only 1/2 votes). The arbiter’s presence is essential for quorum.

18.2 Emergency Failover (Primary Fails Unexpectedly)

# Step 1: Verify the situation
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval 'rs.status()'" 2>/dev/null \
  || echo "bms-3 unreachable"
 
# Step 2: Check from bms-2 perspective
ssh ubuntu@145.239.133.104 "mongosh --quiet --eval '
  const s = rs.status();
  s.members.forEach(m => print(m.name, m.stateStr, m.health));
  print(\"myState:\", s.myState);
'"
 
# Step 3: If arbiter (bms-4) is healthy and bms-3 is truly down:
# The replica set CANNOT elect bms-2 as PRIMARY because bms-2 has priority:0
# The set will be in READ-ONLY state (only 1 vote: arbiter, but arbiter cannot become primary)
# 
# RECOVERY OPTIONS:
# A) Restore bms-3 and let it rejoin as PRIMARY (preferred)
# B) Change bms-2 priority to 1 to allow it to become PRIMARY (emergency only)
#    Requires admin credentials — this is a HUMAN ACTION
 
# Step 4 (Human Action): Temporarily promote bms-2
# Run on bms-2 with admin credentials:
# cfg = rs.conf();
# cfg.members[0].priority = 1;  // adjust index for bms-2 member
# cfg.members[0].votes = 1;
# rs.reconfig(cfg);
# rs.status();  // should elect bms-2 as PRIMARY
 
# Step 5: Alert human operator immediately
source ~/.claude-env
curl -X POST "$DISCORD_WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d '{"content":"CRITICAL: MongoDB rs0 has no PRIMARY. bms-3 may be down. HUMAN ACTION REQUIRED immediately. See runbook section 18.2."}'

18.3 Arbiter Loss Recovery

If bms-4 (ARBITER) is down:

  • Immediate impact: rs0 can still operate (bms-3 and bms-2 form quorum if bms-3 has 1 vote and bms-2 has 0 votes — actually bms-3 alone with 1/2 votes loses quorum)
  • Wait: Actually with bms-2 non-voting, losing the arbiter means bms-3 has 1/2 votes (needs 2) → rs0 loses quorum and bms-3 becomes SECONDARY
  • Critical impact: All writes to MongoDB stop until arbiter is restored
  • Resolution: Restart mongod on bms-4 first — it’s the quickest recovery path
ssh root@54.36.123.110 "
  systemctl status mongod
  systemctl restart mongod
  sleep 10
  systemctl status mongod
"
 
# Verify arbiter rejoined
ssh ubuntu@51.68.155.224 "mongosh --quiet --eval '
  const arb = rs.status().members.find(m => m.stateStr === \"ARBITER\");
  if (arb) print(\"Arbiter OK:\", arb.name); else print(\"Arbiter NOT found\");
'"

19. Capacity Planning Guide

Current Baseline (2026-06-14)

ServerRAM TotalRAM UsedDisk TotalDisk UsedGrowth Risk
bms-3 (PRIMARY)32 GB21.7 GB (MongoDB) + ~4 GB Docker410 GB170 GB (44%)HIGH — OOM risk
bms-2 (observer)32 GB~2 GB (MongoDB replica)410 GB62 GB (16%)LOW
bms-4 (arbiter)32 GB~75 MB (mongod arbiter)1.8 TB8.3 GB (1%)VERY LOW

RAM Capacity Thresholds — bms-3

Available RAMAction
> 8 GBNormal
4–8 GBWarning: monitor hourly
2–4 GBAlert: plan immediate maintenance window
< 2 GBCritical: initiate emergency mongod restart or workload migration

bms-3 RAM forecast: MongoDB 7.0 uses WiredTiger cache = 50% of RAM by default = 16 GB. Current data set appears to fill this. As data grows, MongoDB may attempt to use more. Watch wiredTigerCacheBytesInUse metric.

// Check WiredTiger cache usage
use admin;
const st = db.serverStatus().wiredTiger.cache;
print("Cache target:", st["maximum bytes configured"] / 1024/1024/1024 + " GB");
print("Cache used:", st["bytes currently in the cache"] / 1024/1024/1024 + " GB");
print("Evictions:", st["pages evicted by application threads"]);

Disk Capacity Thresholds — bms-3

MongoDB data directory + Docker images + logs on 410 GB:

  • MongoDB: ~150 GB estimated (oplog + data)
  • Docker: ~20 GB (staging images)
  • Available headroom: ~200 GB at 44% used

Action triggers:

  • 60% disk (246 GB used) → run docker image prune
  • 70% disk (287 GB used) → evaluate moving old staging versions to archive
  • 80% disk (328 GB used) → emergency cleanup or MongoDB oplog resize

Key Metrics to Track Weekly

  1. db.serverStatus().opcounters — query/insert/update/delete rate
  2. Oplog window (hours) — must stay > 24h for safe secondary operations
  3. Replication lag trend — should stay near 0
  4. db.serverStatus().connections.current — connection count trend
  5. WiredTiger cache eviction rate — high eviction = memory pressure

Capacity Forecast Queries

// Database size growth (run weekly, track over time)
use admin;
db.adminCommand({listDatabases: 1}).databases
  .filter(d => d.name !== 'local')
  .sort((a, b) => b.sizeOnDisk - a.sizeOnDisk)
  .forEach(d => {
    print(`${d.name}: ${(d.sizeOnDisk / 1024/1024/1024).toFixed(3)} GB`);
  });

20. Alert Definitions

Prometheus Alert Rules for MongoDB rs0

Add to monitoring/prometheus/rules/mongodb.yml:

groups:
  - name: mongodb_rs0
    interval: 60s
    rules:
 
      # ─── Connectivity ──────────────────────────────────────────────────────────
 
      - alert: MongoDBMemberDown
        expr: |
          # Port probe — until mongodb-exporter available
          probe_success{job="blackbox_tcp", instance=~".*27017.*"} == 0
        for: 5m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "MongoDB member unreachable: {{ $labels.instance }}"
          description: "MongoDB port 27017 not responding on {{ $labels.instance }}. Check if mongod is running."
 
      # ─── RAM pressure on bms-3 ─────────────────────────────────────────────────
 
      - alert: BMS3MemoryCritical
        expr: |
          node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"} / 1024^3 < 2
        for: 5m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "bms-3 available RAM < 2 GB"
          description: "MongoDB PRIMARY (bms-3) has < 2 GB available RAM. OOM risk is high. MongoDB is using ~21.7 GB. Immediate action required."
 
      - alert: BMS3MemoryWarning
        expr: |
          node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"} / 1024^3 < 4
        for: 10m
        labels:
          severity: warning
          team: infra
        annotations:
          summary: "bms-3 available RAM < 4 GB"
          description: "MongoDB PRIMARY (bms-3) memory is getting low. Available: {{ $value | humanize }}B. Plan maintenance window."
 
      # ─── Disk usage ────────────────────────────────────────────────────────────
 
      - alert: BMS3DiskWarning
        expr: |
          (node_filesystem_size_bytes{server="p4-ovh-bms-3-ns3129867", mountpoint="/"} -
           node_filesystem_avail_bytes{server="p4-ovh-bms-3-ns3129867", mountpoint="/"}) /
           node_filesystem_size_bytes{server="p4-ovh-bms-3-ns3129867", mountpoint="/"} > 0.70
        for: 15m
        labels:
          severity: warning
          team: infra
        annotations:
          summary: "bms-3 disk usage > 70%"
          description: "Disk on bms-3 is {{ $value | humanizePercentage }} full. Run docker image prune and check MongoDB oplog size."
 
      - alert: BMS1DiskCritical
        expr: |
          (node_filesystem_size_bytes{server="p4-ovh-bms-1-ns367522", mountpoint="/"} -
           node_filesystem_avail_bytes{server="p4-ovh-bms-1-ns367522", mountpoint="/"}) /
           node_filesystem_size_bytes{server="p4-ovh-bms-1-ns367522", mountpoint="/"} > 0.95
        for: 5m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "bms-1 (Pinbox24 PRODUCTION) disk > 95%"
          description: "Pinbox24 production server disk at {{ $value | humanizePercentage }}. Writes may fail. EMERGENCY: immediate cleanup required."
 
      # ─── Replication (once mongodb-exporter is deployed) ──────────────────────
 
      - alert: MongoDBReplicationLagHigh
        expr: |
          # Placeholder — replace with actual mongodb-exporter metric
          # mongodb_rs_member_optime_date{state="SECONDARY"} - on() mongodb_rs_member_optime_date{state="PRIMARY"} > 60
          absent(up{job="mongodb"}) == 1
        for: 5m
        labels:
          severity: warning
          team: infra
        annotations:
          summary: "MongoDB exporter not yet deployed"
          description: "Deploy mongodb-exporter to get replication lag metrics. Manual check: ssh ubuntu@51.68.155.224 mongosh --eval 'rs.status()'"
 
      # ─── Blackbox TCP probes (to create in prometheus.yml) ───────────────────
 
      # Add to monitoring/prometheus/prometheus.yml blackbox job:
      # - targets:
      #   - 51.68.155.224:27017   # bms-3 MongoDB
      #   - 145.239.133.104:27017  # bms-2 MongoDB
      #   - 54.36.123.110:27017   # bms-4 MongoDB arbiter
      #   labels: { job: blackbox_tcp, module: tcp_connect }

Alertmanager Routing for MongoDB

Add to monitoring/alertmanager/config.yml:

routes:
  - match:
      team: infra
      severity: critical
    receiver: discord-critical
    repeat_interval: 30m
    continue: true
  - match:
      team: infra
      severity: warning
    receiver: discord-warning
    repeat_interval: 4h

Manual Alert Test

# Send test alert via Alertmanager API
curl -X POST http://217.154.82.162:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "MongoDBTestAlert",
      "severity": "warning",
      "team": "infra",
      "instance": "54.36.123.110:27017"
    },
    "annotations": {
      "summary": "Test alert from runbook",
      "description": "Manual test of MongoDB alert routing"
    },
    "endsAt": "'"$(date -u -d '+5 minutes' +%Y-%m-%dT%H:%M:%SZ)"'"
  }]'

Appendix: Quick Reference Card

rs0 Member Summary

ServerIPPortRoleVotesPriority
bms-3 (ns3129867)51.68.155.22427017PRIMARY/SECONDARY11
bms-2 (ns3087638)145.239.133.10427017SECONDARY observer00
bms-4 (ns3101999)54.36.123.11027017ARBITER10

Quorum: 2 votes required. bms-3 (1) + bms-4 (1) = majority. If arbiter down, no quorum.

Emergency Contacts

IssueFirst actionEscalate if
No MongoDB PRIMARYCheck arbiter health firstArbiter healthy but still no PRIMARY
bms-1 disk 100%Run docker pruneDisk still 100% after cleanup
WAHA session downPOST /api/sessions/default/restartSession fails to restart
Atrax GPS stale > 30 minCheck n8n container, trigger workflown8n healthy but workflow still fails

AI-Dev-BMS4 Status Check

# From local workstation — check if nightly ran successfully
ssh root@54.36.123.110 "tail -20 /var/log/bms4-nightly.log"
 
# Check last cron execution time
ssh root@54.36.123.110 "ls -la /var/log/bms4-nightly.log && grep 'END' /var/log/bms4-nightly.log | tail -3"
 
# Check active agent sessions on bms-4
# (requires SUPABASE credentials)