BMS Modernization Plan

Date: 2026-06-14
Scope: All OVH Kimsufi bare metal servers — bms-1 through bms-4
Author: Claude Code (p24-infra admin role)
Status: Active planning document — update as phases complete


1. Executive Summary

Five critical findings require immediate action before any new work proceeds:

  1. bms-1: Ubuntu 20.04 EOL since April 2025 — running live production. No security patches for 14+ months. Every known CVE published since April 2025 is unmitigated on the server hosting Pinbox24 production traffic.

  2. bms-1: No off-server backup exists. 24 production Docker containers, a PostgreSQL database, a Redis instance, and a PM2 service have zero off-site backup. A single disk failure, ransomware event, or OVH hardware incident would be an unrecoverable data loss.

  3. bms-1: Three containers (v32-prod-socket, v32-prod-reso, v41-prod) run from locally-stored untagged images with no registry source. If these containers stop for any reason — host reboot, OOM kill, Docker daemon restart — they cannot be restarted. They are live production services.

  4. bms-3: MongoDB using 21.7 GB RAM on a 32 GB server shared with 11 Docker containers. Free RAM is approximately 10 GB. Any memory spike — Docker build, staging load test, container restart — risks OOM-killing the MongoDB primary, triggering an unplanned rs0 election.

  5. Monitoring gap: bms-2 and bms-3 have no node_exporter and are not in Prometheus. Issues on these servers — disk fill, RAM exhaustion, high CPU — are invisible until they cause an outage.

Immediate actions (before any phase work):

  • Freeze bms-1: no restarts, no new deploys until Phase 1 is complete
  • Export and safely store the three untagged Docker images from bms-1
  • Install node_exporter on bms-2 and bms-3 and connect to Prometheus

2. Current State Assessment

Security Scorecard

Categorybms-1bms-2bms-3bms-4
OS currencyCRITICAL (Ubuntu 20.04 EOL Apr 2025)Good (Ubuntu 24.04 LTS)Fair (Ubuntu 22.04, EOL Apr 2027)Fair (Ubuntu 22.04, EOL Apr 2027)
SSH securityPoor (root login, no fail2ban, no ufw)Fair (ubuntu user, no fail2ban)Fair (ubuntu user, no fail2ban)Poor (root-only login, no fail2ban)
FirewallPoor (ufw inactive, iptables partial)UnknownUnknownUnknown
Docker securityPoor (Docker 24, old; Portainer v1 legacy)N/A (no Docker)Fair (Docker present, Portainer legacy)Good (Docker CE 29.5.3 current)
MongoDB securityN/AGood (auth + keyFile, port 0.0.0.0 concern)Good (auth + keyFile, port 0.0.0.0 concern)Good (auth + keyFile, arbiter only)
Off-server backupCRITICAL (none)Good (no stateful data)Fair (MongoDB not explicitly backed up)Good (no production data yet)
Monitoring coveragePartial (node_exporter installed 2026-06-14)NoneNoneGood (node_exporter + cadvisor)
Secrets managementUnknown (env vars in containers, not inventoried)Fair (.env.local reference)UnknownFair (.env on server)
Unattended upgradesOff (EOL OS, moot)UnknownUnknownUnknown
Disk encryptionNone (OVH bare metal, no LUKS)NoneNoneNone
Log aggregationNone (Netdata local only)NoneNoneNone
Image taggingCRITICAL (3 untagged images)N/AFair (all from ECR)Good (all from registries)

Summary Score (1 = critical, 5 = good)

ServerScorePrimary concern
bms-11/5EOL OS + no backups + untagged images
bms-23/5Monitoring gap + pending claude-admin setup
bms-32/5MongoDB OOM risk + no monitoring + Ubuntu 22.04 lifecycle
bms-44/5Newly provisioned — pending n8n migration + hardening

3. Priority Issues Register

All issues across all servers sorted by priority.

P1 — Critical: Risk of data loss or unrecoverable service failure

IDServerIssueImpact
P1-01bms-1Ubuntu 20.04 EOL (April 2025) — no security patches for 14 monthsActive CVE exposure on live production
P1-02bms-1No off-server backup of any serviceTotal data loss on hardware failure
P1-03bms-1Three untagged Docker images in use (v32-prod-socket, v32-prod-reso, v41-prod)Unrecoverable service failure on any container stop
P1-04bms-1Disk at 85% used — ~170 GB unreviewed backups in /rootWill reach 100% again; production containers stop when disk is full
P1-05bms-3MongoDB PRIMARY using 21.7 GB of 32 GB RAM shared with 11 containersOOM kill of MongoDB PRIMARY → unplanned rs0 election

P2 — High: Risk of unplanned downtime or significant security exposure

IDServerIssueImpact
P2-01bms-1Portainer v1 (5 years old) exposed on port 49154Unpatched management UI; known CVEs in Portainer v1
P2-02bms-1s3-v42-prod-02-25-old deprecated container still runningResource waste; attack surface
P2-03bms-1PM2 NodeChat v1.0.0 at :3001 — 4 years running, unknown trafficIf serving live traffic, no recovery plan if it crashes
P2-04bms-1Port :8081 — unknown Node.js processUnknown service; potential security exposure
P2-05bms-1PostgreSQL and Redis running host-native, not backed upSilent data loss path
P2-06bms-1ufw inactive, no firewall policyAll ports reachable from internet by default
P2-07bms-2No node_exporter / Prometheus monitoringBlind to disk fill, RAM, CPU on MongoDB observer
P2-08bms-3No node_exporter / Prometheus monitoringBlind to impending OOM on MongoDB PRIMARY
P2-09bms-3MongoDB port 27017 bound to 0.0.0.0 — firewall state unknownPotential direct internet exposure of database port
P2-10bms-2MongoDB port 27017 bound to 0.0.0.0 — firewall state unknownPotential direct internet exposure of database port
P2-11bms-4n8n migration from vps-h1 pending — workflows running on unrelated hostSplit risk: n8n failure on vps-h1 affects bms-4 plans
P2-12bms-4rs.addArb (MongoDB arbiter join) pending — rs0 has no quorum noders0 is running without its intended 3-node quorum

P3 — Medium: Security hardening not yet implemented

IDServerIssueImpact
P3-01ALLNo fail2ban on any serverBrute-force SSH attempts unchecked
P3-02ALLNo unattended-upgrades configuredSecurity patches require manual application
P3-03bms-1Root SSH login with direct key accessWeaker access model; violates principle of least privilege
P3-04bms-4Root-only SSH, no claude-admin userInconsistent with bms-2/bms-3 access model
P3-05bms-1GitLab runner active — unclear if used alongside git-deploy containerParallel deploy paths; audit and consolidation needed
P3-06bms-3Disk at 44% — no automated alert below 70%Risk of undetected disk fill as staging workloads grow
P3-07ALLNo centralized log aggregation (Loki/Promtail)Post-incident forensics impossible; no log-based alerts
P3-08bms-1v3.x containers using 4–5-year-old images from private registryPotentially unreachable registry; severe unpatched vulnerabilities
P3-09bms-3Staging containers with ECR images up to 6 months since pullSilent vulnerability drift
P3-10bms-1/var/log at 16 GB — no logrotate configuredWill contribute to disk fill
P3-11bms-1Netdata running locally but not integrated with PrometheusWasted monitoring resource; double agent overhead
P3-12ALLNo Docker image CVE scanning (Trivy/Grype)Vulnerabilities in deployed containers undiscovered
P3-13bms-2claude-admin user not yet createdClaude agent access not configured
P3-14bms-3claude-admin user not yet createdClaude agent access not configured

P4 — Low: Modernization and lifecycle management

IDServerIssueImpact
P4-01bms-3Ubuntu 22.04 LTS — EOL April 2027 (10 months away)Plan migration before EOL window; MongoDB downtime risk if rushed
P4-02bms-4Ubuntu 22.04 LTS — EOL April 2027Same as P4-01
P4-03bms-1Pinbox24 v3.x (legacy stack) — clients still on v3.xTechnical debt; v3.x images 4–5 years old, EOL stack
P4-04bms-1Registry consolidation — v3.x from private-registry.dev.pinbox24.com (possibly unreachable) vs v4.x from ECRRisk of being unable to pull images for v3.x recovery
P4-05bms-2Claude Code not yet installed (AI-Dev-OV1 pending)bms-2 not operational as AI dev environment
P4-06bms-1Docker 24 (old) — current is 28.xOld Docker misses security fixes and compose improvements
P4-07bms-3Portainer v1 (legacy)Same as bms-1
P4-08bms-1OpenBao secrets migration — env vars in containers not managedNo rotation capability; credential sprawl
P4-09ALLDisk encryption (LUKS at rest) not implementedPhysical server access bypasses all OS security

4. Phase 1: Critical Security

Goal: Eliminate all P1 issues across all servers before proceeding.
Target completion: Within 2 weeks of plan approval.
Prerequisite: Maintenance window agreement with Pinbox24 operators for bms-1 work.


Task 1.1 — bms-1: Export and re-tag untagged Docker images

Priority: P1-03
Effort: 2 hours
Risk: Low (read-only operation, no service changes)
Rollback: N/A (no changes to running services)

Three containers run from locally-stored image IDs with no registry source. They cannot be recovered after a container stop or host reboot.

ssh root@94.23.26.113
 
# Identify image IDs for the three untagged containers
docker inspect v32-prod-socket v32-prod-reso v41-prod \
  --format '{{.Name}}: Image={{.Config.Image}} ID={{.Image}}'
 
# Export each image to a tar archive
docker save $(docker inspect v32-prod-socket --format '{{.Image}}') \
  -o /tmp/v32-prod-socket-image-$(date +%Y%m%d).tar
docker save $(docker inspect v32-prod-reso --format '{{.Image}}') \
  -o /tmp/v32-prod-reso-image-$(date +%Y%m%d).tar
docker save $(docker inspect v41-prod --format '{{.Image}}') \
  -o /tmp/v41-prod-image-$(date +%Y%m%d).tar
 
# Upload to Wasabi for safekeeping
# (requires AWS CLI configured with p24-infra Wasabi credentials)
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
  /tmp/v32-prod-socket-image-*.tar s3://p24-infra/docker-image-exports/bms-1/
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
  /tmp/v32-prod-reso-image-*.tar s3://p24-infra/docker-image-exports/bms-1/
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
  /tmp/v41-prod-image-*.tar s3://p24-infra/docker-image-exports/bms-1/

Acceptance criteria:

  • All three tar files confirmed in Wasabi under docker-image-exports/bms-1/
  • Verify images can be restored: docker load -i /tmp/v41-prod-image-*.tar

Task 1.2 — bms-1: Install automated off-server backup

Priority: P1-02
Effort: 4 hours
Risk: Medium (requires PostgreSQL and Redis audit to understand what data needs backing up)
Rollback: Stop the backup cron job; no service changes

Before any OS migration, an off-server backup must exist. This is a prerequisite for all other bms-1 Phase 1 work.

ssh root@94.23.26.113
 
# Audit what data exists
# PostgreSQL
sudo -u postgres psql -c '\l'    # list databases
sudo -u postgres psql -c '\du'   # list users
 
# Redis
redis-cli info keyspace          # which databases have keys
 
# PM2 NodeChat - what data does it use?
pm2 list
cat /temp/p24-v-3.2/ecosystem.config.js 2>/dev/null || ls /temp/p24-v-3.2/
 
# MongoDB dumps in /root - verify they are current and complete
ls -lh /root/w3-2026-02-05/ /root/w4-2026-02-23/ /root/w4-2026-02-24/

Backup script (/opt/p24-infra/scripts/backup-bms1.sh) — to be created:

#!/bin/bash
set -euo pipefail
BACKUP_DATE=$(date +%Y-%m-%d)
BACKUP_DIR="/tmp/bms1-backup-${BACKUP_DATE}"
S3_PREFIX="s3://p24-infra/bms-1-backups/${BACKUP_DATE}"
 
mkdir -p "$BACKUP_DIR"
 
# 1. PostgreSQL dump
sudo -u postgres pg_dumpall | gzip > "${BACKUP_DIR}/postgres-all.sql.gz"
 
# 2. Redis dump (BGSAVE, then copy RDB)
redis-cli BGSAVE
sleep 5
cp /var/lib/redis/dump.rdb "${BACKUP_DIR}/redis-dump.rdb"
 
# 3. Docker volumes for nginx-proxy certs (Let's Encrypt)
docker cp nginx-proxy:/etc/nginx/certs "${BACKUP_DIR}/nginx-certs" 2>/dev/null || true
 
# 4. Container environment files (redacted - just keys, no values)
for ctr in v42-prod s3-v42-prod mailgun-v42-prod; do
  docker inspect "$ctr" --format '{{range .Config.Env}}{{println .}}{{end}}' \
    | grep -oP '^[A-Z0-9_]+=' > "${BACKUP_DIR}/${ctr}-env-keys.txt" 2>/dev/null || true
done
 
# 5. Upload to Wasabi
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 sync \
  "${BACKUP_DIR}/" "${S3_PREFIX}/" --sse AES256
 
# 6. Cleanup
rm -rf "${BACKUP_DIR}"
 
echo "Backup complete: ${S3_PREFIX}"
# Install cron (daily at 01:00 UTC)
echo "0 1 * * * /opt/p24-infra/scripts/backup-bms1.sh >> /var/log/bms1-backup.log 2>&1" \
  | crontab -l 2>/dev/null | { cat; echo "0 1 * * * /opt/p24-infra/scripts/backup-bms1.sh >> /var/log/bms1-backup.log 2>&1"; } | crontab -

Acceptance criteria:

  • First manual backup run completes without error
  • Files visible in Wasabi p24-infra/bms-1-backups/
  • PostgreSQL can be restored from dump in a test environment

Task 1.3 — bms-1: Emergency disk cleanup

Priority: P1-04
Effort: 2 hours
Risk: Medium — must not delete backups that are still the only copy of live data
Rollback: Cleanup is irreversible; verify each item before deletion

Disk is at 85% (354 GB of 440 GB). At 100%, running containers will crash on any write attempt.

Step 1 — Identify candidates (no deletions yet):

ssh root@94.23.26.113
 
# Largest directories
du -sh /root/*/  | sort -rh | head -20
du -sh /var/lib/docker/  | head -5
docker system df          # Docker image/volume space usage
 
# Log files
du -sh /var/log
ls -lh /var/log/
 
# Old Docker images not used by any container
docker images --filter "dangling=true" --format "{{.ID}} {{.Size}}"

Step 2 — Safe cleanup (after Task 1.2 backup is confirmed):

# Remove deprecated container
docker stop s3-v42-prod-02-25-old
docker rm s3-v42-prod-02-25-old
 
# Remove dangling images
docker image prune -f
 
# Remove unused volumes
docker volume prune -f
 
# Compress and archive /root MongoDB dumps to Wasabi, then delete local copies
# (verify they are already on Wasabi or upload first)
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
  /root/w3-2026-02-05/ s3://p24-infra/bms-1-mongo-dumps/ --recursive
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
  /root/w4-2026-02-23/ s3://p24-infra/bms-1-mongo-dumps/ --recursive
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
  /root/w4-2026-02-24/ s3://p24-infra/bms-1-mongo-dumps/ --recursive
# After confirming upload:
rm -rf /root/w3-2026-02-05 /root/w4-2026-02-23 /root/w4-2026-02-24
 
# Configure logrotate for /var/log
cat > /etc/logrotate.d/bms1-custom << 'EOF'
/var/log/*.log {
  daily
  rotate 7
  compress
  delaycompress
  missingok
  notifempty
}
EOF
logrotate -f /etc/logrotate.d/bms1-custom

Target state: Disk below 60% (264 GB used). This leaves 176 GB of headroom.

Acceptance criteria:

  • df -h /dev/md127 shows < 70% used
  • All production containers still running after cleanup: docker ps | grep -v Exited
  • MongoDB dumps confirmed in Wasabi before local deletion

Task 1.4 — bms-1: OS Migration Planning (Ubuntu 20.04 → 24.04)

Priority: P1-01
Effort: Planning 4 hours; execution 8+ hours with maintenance window
Risk: CRITICAL — this is live production. In-place upgrade is NOT recommended.

Ubuntu 20.04 reached end-of-life in April 2025. An in-place do-release-upgrade on a server running 24 production containers with complex dependencies carries high risk of failure. The recommended approach is a parallel migration:

Recommended strategy: New server migration (not in-place upgrade)

  1. Provision a new OVH Kimsufi server with Ubuntu 24.04
  2. Migrate services one-by-one using blue-green DNS switching
  3. Keep bms-1 running as fallback for 2 weeks after migration
  4. Decommission bms-1 once all traffic confirmed on new server

Phase 1.4 delivers: A detailed migration plan (tracked as a separate document). The actual migration executes in Phase 2.

Migration prerequisites checklist:

  • Task 1.1 complete (untagged images exported and safe)
  • Task 1.2 complete (backup running)
  • Task 1.3 complete (disk < 70%)
  • Port :8081 identified (Task 2.4)
  • PM2 NodeChat traffic audit complete (Task 2.3)
  • PostgreSQL and Redis data fully understood and backed up
  • GitLab runner → git-deploy pipeline documented
  • AWS ECR credentials confirmed available on new server
  • Private registry (private-registry.dev.pinbox24.com) accessibility verified for v3.x images

Key decisions required from Pinbox24 operators before migration:

  1. Is v3.x (w3.pinbox24.com) still serving active clients? Can it be migrated together with v4.x or after?
  2. Is NodeChat on :3001 live traffic? If yes, who owns it?
  3. What is the acceptable maintenance window duration for the migration?

Task 1.5 — bms-3: Add MongoDB memory alert

Priority: P1-05
Effort: 1 hour
Risk: None (monitoring addition only)
Rollback: Remove alert rule

Immediate mitigation while a longer-term RAM strategy is evaluated.

# On vps-i1, add alert to prometheus rules
ssh root@217.154.82.162
cat >> /opt/p24-infra/monitoring/prometheus/rules/infrastructure.yml << 'EOF'
 
  - alert: BMS3MongoDBHighMemory
    expr: |
      node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"} 
      / node_memory_MemTotal_bytes{server="p4-ovh-bms-3-ns3129867"} < 0.15
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "bms-3 free RAM below 15% — MongoDB OOM risk"
      description: "bms-3 has less than 15% free RAM. MongoDB PRIMARY at risk of OOM kill."
EOF

Note: This alert becomes active once bms-3 node_exporter is connected (Task 2.1). For now, manually monitor:

ssh ubuntu@51.68.155.224 "free -h && ps aux --sort=-%mem | head -5"

Phase 1 Rollback Summary

TaskRollback procedure
1.1 Image exportNone needed (read-only). If Wasabi upload fails, re-run upload.
1.2 Backup scriptcrontab -e, remove backup line. Script leaves services untouched.
1.3 Disk cleanupMongoDB dumps: retrieve from Wasabi (aws s3 cp). Docker containers: already removed intentionally.
1.4 OS migrationMigration is on a new server; bms-1 unchanged until DNS cut-over. DNS rollback: repoint DNS records back to bms-1 IP within 60 seconds.
1.5 Alertkubectl delete or edit the rules YAML and reload Prometheus.

Phase 1 Completion Criteria

  • bms-1 untagged images exported and confirmed in Wasabi
  • bms-1 daily backup running; first successful backup confirmed in Wasabi
  • bms-1 disk below 70%
  • bms-3 MongoDB memory alert defined (fires when < 15% free RAM)
  • bms-1 OS migration plan written and approved

5. Phase 2: Stability and High Risk

Goal: Eliminate all P2 issues. All servers stable with no single points of failure.
Target completion: Within 6 weeks of Phase 1 completion.


Task 2.1 — bms-2 and bms-3: Install node_exporter and connect to Prometheus

Priority: P2-07, P2-08
Effort: 1 hour
Risk: Low
Rollback: Stop and disable node_exporter service

# bms-2
ssh ubuntu@145.239.133.104
sudo apt-get update && sudo apt-get install -y prometheus-node-exporter
sudo systemctl enable --now prometheus-node-exporter
# Verify
curl http://localhost:9100/metrics | head -3
 
# bms-3
ssh ubuntu@51.68.155.224
sudo apt-get update && sudo apt-get install -y prometheus-node-exporter
sudo systemctl enable --now prometheus-node-exporter
curl http://localhost:9100/metrics | head -3

Add to monitoring/prometheus/prometheus.yml under the node job:

- targets: ['145.239.133.104:9100']    # p4-ovh-bms-2-ns3087638 — MongoDB rs0 observer + AI-Dev-OV1
  labels: { env: production, server_type: baremetal, server: p4-ovh-bms-2-ns3087638, location: ovh-fr }
- targets: ['51.68.155.224:9100']      # p4-ovh-bms-3-ns3129867 — MongoDB rs0 PRIMARY + staging
  labels: { env: production, server_type: baremetal, server: p4-ovh-bms-3-ns3129867, location: ovh-fr }

Reload Prometheus:

ssh root@217.154.82.162 "curl -sX POST http://localhost:9090/-/reload && echo OK"

Acceptance criteria:

  • up{job="node", server=~"p4-ovh-bms-[23]-.*"} shows value 1 in Prometheus
  • Grafana “Servers Overview” dashboard shows bms-2 and bms-3 metrics

Task 2.2 — MongoDB port firewall hardening (bms-2 and bms-3)

Priority: P2-09, P2-10
Effort: 2 hours
Risk: Medium — incorrect firewall rules can break rs0 replication
Rollback: sudo ufw disable to immediately restore open access

MongoDB is bound to 0.0.0.0:27017 on both bms-2 and bms-3. Port 27017 should only accept connections from the three replica set members and the local machine.

rs0 member IPs that need MongoDB access:

ServerIP
bms-2145.239.133.104
bms-351.68.155.224
bms-454.36.123.110
# bms-3 — run as ubuntu with sudo
ssh ubuntu@51.68.155.224
 
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp comment 'SSH'
sudo ufw allow 80/tcp comment 'HTTP'
sudo ufw allow 443/tcp comment 'HTTPS'
# MongoDB: only from replica set members
sudo ufw allow from 145.239.133.104 to any port 27017 comment 'MongoDB bms-2'
sudo ufw allow from 54.36.123.110 to any port 27017 comment 'MongoDB bms-4'
sudo ufw allow from 127.0.0.1 to any port 27017 comment 'MongoDB localhost'
# node_exporter: only from Prometheus (vps-i1)
sudo ufw allow from 217.154.82.162 to any port 9100 comment 'Prometheus node_exporter'
# Enable (KEEP EXISTING SSH SESSION OPEN)
sudo ufw --force enable
sudo ufw status verbose

Apply identical rules on bms-2, substituting the bms-3 IP for bms-2’s self-referencing rules.

Critical: Keep an active SSH session open before running ufw enable. Open a second session immediately after to verify SSH still works.

Acceptance criteria:

  • sudo ufw status shows active with correct rules on both servers
  • rs0 replication continues: mongosh --eval 'rs.status()' on bms-3 shows all members healthy
  • Port 27017 is NOT reachable from an external IP: nmap -p 27017 51.68.155.224 from a machine not in the allowlist

Task 2.3 — bms-1: Investigate and document PM2 NodeChat and port 8081

Priority: P2-03, P2-04
Effort: 2 hours
Risk: Low (investigation only)

ssh root@94.23.26.113
 
# PM2 investigation
pm2 list
pm2 show 0    # details for NodeChat
cat /temp/p24-v-3.2/package.json
# Check if anything connects to :3001
netstat -tlnp | grep 3001
# Check nginx-proxy config for :3001
docker exec nginx-proxy cat /etc/nginx/conf.d/default.conf | grep -i 3001 || echo "Not in nginx"
 
# Port 8081 investigation
netstat -tlnp | grep 8081
lsof -i :8081
# Identify the process
ls -la /proc/$(lsof -ti :8081)/exe 2>/dev/null

Expected outcome: A documented entry in the bms-1 operations workbook for both services with:

  • What the service does
  • Whether it serves live traffic (via nginx-proxy or direct)
  • Whether it can be safely stopped
  • Migration plan if it is live traffic

Task 2.4 — bms-1: Set up Portainer v1 replacement or upgrade

Priority: P2-01
Effort: 2 hours
Risk: Low — Portainer is a management tool, not a production service
Rollback: docker start portainer-pinbox24

Portainer v1 is over 5 years old and has known CVEs. Port 49154 is exposed to the internet.

ssh root@94.23.26.113
 
# Stop and remove Portainer v1
docker stop portainer-pinbox24
docker rm portainer-pinbox24
 
# Option A: Install Portainer CE (v2+, free)
docker volume create portainer_data
docker run -d -p 49154:9000 --name portainer \
  --restart=unless-stopped \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v portainer_data:/data \
  portainer/portainer-ce:latest
 
# Option B: Add nginx-proxy auth and restrict to localhost only (more secure)
# Then access via SSH tunnel: ssh -L 49154:localhost:49154 root@94.23.26.113

Recommended: Option B (localhost only with SSH tunnel) provides better security while preserving access. Option A upgrades functionality but keeps the port open.


Task 2.5 — bms-4: Complete n8n migration from vps-h1

Priority: P2-11
Effort: 4 hours
Risk: Medium — n8n workflows are live automation; must verify all workflows intact
Rollback: Repoint WAHA webhook back to n8n.vps-h1.infra.zintegrowana.online

Follow the migration procedure documented in docs/servers/p4-ovh-bms-4-ns3101999-operations.md under “n8n Migration from vps-h1”.

Pre-migration checklist:

  • bms-4 docker-compose.yml deployed to server
  • .env from vps-h1 copied to bms-4
  • bms-4 volumes created and data restored
  • n8n accessible at https://n8n.bms-4.infra.zintegrowana.online
  • All workflows visible and enabled in n8n UI
  • Test execution of at least one workflow manually

Post-migration:

  • WAHA webhook URL updated
  • n8n on vps-h1 stopped (NOT removed for 7 days as fallback)
  • Monitoring: n8n health check added to Prometheus/Grafana

Task 2.6 — bms-4: Complete MongoDB arbiter join (HUMAN ACTION REQUIRED)

Priority: P2-12
Effort: 30 minutes
Risk: Low — adding an arbiter does not affect data, only quorum
Rollback: rs.remove("54.36.123.110:27017") on bms-3

# Connect to bms-3 (holds MongoDB admin credentials)
ssh ubuntu@51.68.155.224
mongosh -u admin -p "$MONGODB_ADMIN_PASSWORD" --authenticationDatabase admin
 
# In mongosh:
rs.addArb("54.36.123.110:27017")
rs.remove("51.83.132.99:27017")   # remove dead arbiter
rs.status()
# Expected: 3 members — 1 PRIMARY (bms-3), 1 SECONDARY (bms-2, non-voting), 1 ARBITER (bms-4)
exit

Note: MongoDB admin password must be obtained from Pinbox24 application secrets or AWS Secrets Manager. This task requires human action (Claude does not have the MongoDB admin password).


Task 2.7 — bms-3: Evaluate MongoDB RAM usage and capacity planning

Priority: P1-05 mitigation
Effort: 2 hours analysis
Risk: None (analysis only)

With MongoDB using 21.7 GB of 32 GB on bms-3, the remaining 10.3 GB is shared between the OS, Docker daemon, and 11 containers.

ssh ubuntu@51.68.155.224
free -h
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}"

Options to evaluate:

OptionProsCons
Set WiredTiger cache limitImmediate RAM reductionMay impact MongoDB query performance
Move staging containers to bms-4Separates workloads cleanlyRequires staging URLs reconfiguration
Upgrade bms-3 to higher-RAM OVH serverNo application changesOVH migration cost and downtime
Promote bms-2 to PRIMARY, move workloadDistributes loadComplex rs0 reconfiguration

Recommended: Set WiredTiger cache limit on bms-3 as immediate relief, then migrate staging containers to bms-4 (which has 1.8 TB disk and 32 GB RAM with only ~75 MB used by MongoDB arbiter).

# Immediate: limit WiredTiger cache to 8 GB (default is half of RAM)
# Edit /etc/mongod.conf on bms-3
sudo sed -i '/wiredTiger:/,/^[^ ]/{s/cacheSizeGB:.*/cacheSizeGB: 8/}' /etc/mongod.conf
# Or add if not present:
cat >> /etc/mongod.conf << 'EOF'
storage:
  wiredTiger:
    engineConfig:
      cacheSizeGB: 8
EOF
sudo systemctl restart mongod

Acceptance criteria:

  • db.serverStatus().wiredTiger.cache["maximum bytes configured"] in mongosh shows ~8 GB
  • Free RAM on bms-3 increases to > 15 GB after restart

Phase 2 Rollback Summary

TaskRollback procedure
2.1 node_exportersystemctl stop prometheus-node-exporter
2.2 UFW hardeningufw disable restores open access immediately
2.3 PM2 / port 8081Investigation only, no rollback needed
2.4 Portainer upgradedocker stop portainer && docker run ... portainer/portainer:latest (v1 image)
2.5 n8n migrationUpdate WAHA webhook back to vps-h1; restart n8n on vps-h1
2.6 MongoDB arbiterrs.remove("54.36.123.110:27017") from bms-3 mongosh
2.7 WiredTiger limitRemove cacheSizeGB from mongod.conf and restart

Phase 2 Completion Criteria

  • bms-2 and bms-3 visible in Prometheus and Grafana
  • Port 27017 not reachable from internet on bms-2 and bms-3
  • PM2 NodeChat service fully documented and decision made (keep/decommission)
  • Port 8081 identified and documented
  • Portainer v1 replaced on bms-1
  • n8n migrated to bms-4 and stable for 7 days
  • rs0 has 3 functioning members including bms-4 arbiter
  • bms-3 MongoDB cache limited to 8 GB; free RAM > 15 GB

6. Phase 3: Hardening

Goal: Implement security best practices across all servers. No blocking P3 issues.
Target completion: Within 12 weeks of Phase 1 completion.


Task 3.1 — ALL: Deploy fail2ban and SSH hardening via Ansible

Priority: P3-01, P3-03, P3-04
Effort: 1 day (follows spec docs/improvements/09-ssh-hardening.md)

The SSH hardening spec is already written. For BMS servers, adapt the VPS procedure:

For bms-3 and bms-2 (ubuntu user access):

# Install fail2ban
ssh ubuntu@51.68.155.224
sudo apt-get update && sudo apt-get install -y fail2ban
sudo cat > /etc/fail2ban/jail.local << 'EOF'
[sshd]
enabled = true
port = 22
filter = sshd
logpath = /var/log/auth.log
maxretry = 5
bantime = 3600
findtime = 600
EOF
sudo systemctl enable --now fail2ban
sudo fail2ban-client status sshd

For bms-1 (root-only access) — apply after Phase 2 (migration planned):

Given bms-1 is being migrated, defer SSH hardening to the new server setup.

For bms-4 (root-only access):

ssh root@54.36.123.110
apt-get update && apt-get install -y fail2ban
# Create claude-admin user for non-root access
useradd -m -s /bin/bash claude-admin
mkdir -p /home/claude-admin/.ssh
# Install VPS_SSH_PRIVATE_KEY public key
echo "<VPS_SSH_PRIVATE_KEY public part>" > /home/claude-admin/.ssh/authorized_keys
chmod 700 /home/claude-admin/.ssh && chmod 600 /home/claude-admin/.ssh/authorized_keys
chown -R claude-admin:claude-admin /home/claude-admin/.ssh
echo "claude-admin ALL=(ALL) NOPASSWD: /usr/bin/docker, /bin/systemctl, /bin/mkdir, /bin/chown, /bin/cp, /usr/bin/tee" \
  | tee /etc/sudoers.d/claude-admin
# Test claude-admin access BEFORE disabling root

Task 3.2 — ALL: Configure unattended-upgrades

Priority: P3-02
Effort: 30 minutes per server
Risk: Low — unattended-upgrades is configured to only apply security updates, not dist-upgrades

# bms-2, bms-3, bms-4 (Ubuntu)
sudo apt-get install -y unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades   # interactive: select Yes
# Verify
sudo unattended-upgrades --dry-run --debug 2>&1 | head -20

Task 3.3 — bms-3: Deploy staging containers to bms-4

Priority: P3-06 (disk fill risk) + P1-05 mitigation (RAM)
Effort: 4 hours
Risk: Medium — requires updating all staging URLs and testing

Moving Pinbox24 staging containers (v31/v32/v41/v42) from bms-3 to bms-4 would:

  • Free ~10+ GB RAM on bms-3 for MongoDB
  • Use bms-4’s 1.8 TB disk (currently only 1% used)
  • Separate MongoDB PRIMARY from staging workloads
# 1. On bms-3: export compose configuration
docker inspect v42-stage --format '{{json .Config.Env}}' > /tmp/v42-stage-env.json
docker inspect v41-stage --format '{{json .Config.Env}}' > /tmp/v41-stage-env.json
# ... repeat for all staging containers
 
# 2. Create bms-4/staging-docker-compose.yml in the repo
# 3. Deploy on bms-4
# 4. Update DNS: *.staging.pinbox24.com → 54.36.123.110
# 5. Verify staging works on new server
# 6. Stop staging containers on bms-3 after verification period

Task 3.4 — bms-1: Configure logrotate

Priority: P3-10
Effort: 30 minutes
Risk: None

ssh root@94.23.26.113
 
# Check current logrotate config
logrotate -d /etc/logrotate.conf 2>&1 | tail -20
 
# Fix large /var/log
cat > /etc/logrotate.d/bms1-nginx << 'EOF'
/var/log/nginx/*.log {
  daily
  rotate 14
  compress
  delaycompress
  missingok
  notifempty
  sharedscripts
  postrotate
    docker exec nginx-proxy nginx -s reopen 2>/dev/null || true
  endscript
}
EOF
 
# Force immediate rotation
logrotate -f /etc/logrotate.d/bms1-nginx

Task 3.5 — ALL: Docker image CVE scanning with Trivy

Priority: P3-08, P3-09, P3-12
Effort: 1 day (follows spec docs/improvements/08-image-cve-scanning.md)

For bms-1 specifically — the v3.x images are 4–5 years old and will have critical CVEs. The scan results inform whether to emergency-upgrade or migrate clients off v3.x.

# Install Trivy on bms-1
ssh root@94.23.26.113
curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh \
  | sh -s -- -b /usr/local/bin v0.51.0
 
# Scan all running container images
docker ps --format '{{.Image}}' | sort -u | while read img; do
  echo "=== Scanning: $img ==="
  trivy image --severity HIGH,CRITICAL "$img" 2>/dev/null | tail -20
done

Task 3.6 — bms-2 and bms-3: Create claude-admin user

Priority: P3-13, P3-14
Effort: 30 minutes per server

Follow the claude-admin setup procedure already documented in each server’s operations workbook. This enables Claude agent access for automated tasks.


Phase 3 Completion Criteria

  • fail2ban active and configured on bms-2, bms-3, bms-4
  • unattended-upgrades enabled on bms-2, bms-3, bms-4
  • bms-4 has claude-admin user; bms-2 and bms-3 have claude-admin user
  • Trivy scan complete on bms-1; CVE report reviewed; P1 CVEs addressed
  • logrotate configured on bms-1; /var/log < 5 GB
  • Staging containers migrated to bms-4 OR bms-3 MongoDB cache limited (at least one RAM remedy deployed)

7. Phase 4: Modernization

Goal: All servers on latest LTS. All software current. No legacy technical debt blocking new work.
Target completion: Before February 2027 (3 months before Ubuntu 22.04 EOL in April 2027).


Task 4.1 — bms-1: Complete OS migration to Ubuntu 24.04

Priority: P4 (planning in Phase 1, execution here)
Effort: 8–16 hours depending on complexity
Risk: HIGH — live production migration

This is the migration designed in Phase 1 Task 1.4. Key steps:

  1. Provision new OVH server with Ubuntu 24.04
  2. Install Docker CE (current), nginx-proxy, AWS ECR authentication
  3. Deploy v4.x stack (containers from ECR — cleanest)
  4. Deploy v3.x stack (from exported images — Task 1.1 export)
  5. Test all services in staging configuration
  6. Schedule maintenance window with Pinbox24 operators
  7. DNS cut-over with short TTL (60 seconds)
  8. Monitor for 24 hours on new server
  9. Decommission old bms-1 (keep for 7 days as fallback)

Zero-downtime strategy:

  • Use nginx-proxy on both old and new server simultaneously
  • Cut over one domain at a time, verifying each before proceeding
  • Start with lowest-traffic service (v3.x staging variants)
  • End with v42-prod (highest traffic, last to switch)

Task 4.2 — bms-3 and bms-4: Plan Ubuntu 22.04 → 24.04 upgrade

Priority: P4-01, P4-02
Effort: 4 hours planning; execution timed around MongoDB maintenance
Timeline: Must complete before April 2027

For bms-3 (MongoDB PRIMARY), the upgrade strategy requires careful rs0 coordination:

Phase 4.2 procedure:
1. Promote bms-2 to PRIMARY (rs.stepDown() on bms-3)
2. Perform do-release-upgrade on bms-3 (Ubuntu 22.04 → 24.04)
   OR provision new bms-3-replacement and migrate
3. Re-join bms-3 to rs0 as SECONDARY
4. Verify rs0 healthy for 24h
5. Perform same upgrade on bms-4 (lower risk: arbiter only)
# Step 1: Force PRIMARY stepdown to bms-2 (run from bms-3)
# Note: bms-2 is observer (non-voting, priority 0) — need to temporarily raise priority
mongosh -u admin -p "$MONGODB_ADMIN_PASSWORD" --authenticationDatabase admin --eval '
  cfg = rs.conf();
  cfg.members[1].priority = 2;  // bms-2 index in members array - verify first
  rs.reconfig(cfg);
  rs.stepDown();
'

Note: bms-2 is configured as non-voting observer (priority 0). A PRIMARY failover with only one other data member (bms-2) and one arbiter requires careful configuration. Verify rs0 configuration before executing.


Task 4.3 — bms-2: Install Claude Code (AI-Dev-OV1)

Priority: P4-05
Effort: 2 hours

ssh ubuntu@145.239.133.104
 
# Install Claude Code
curl -fsSL https://claude.ai/install.sh | bash
# Or follow provision-vps playbook
 
# Copy OAuth credentials from local
scp ~/.claude/.credentials.json ubuntu@145.239.133.104:/tmp/
ssh ubuntu@145.239.133.104 "mkdir -p ~/.claude && mv /tmp/.credentials.json ~/.claude/"
 
# Verify
claude --version

Task 4.4 — bms-1 (new server): Docker version upgrade

Priority: P4-06
Effort: 0 additional hours (handled during Task 4.1 new server provisioning)

New bms-1 server will be provisioned with Docker CE current (29.x). The legacy Docker 24 on old bms-1 does not need upgrading — the server is being replaced.


Task 4.5 — bms-1: v3.x sunset planning

Priority: P4-03, P4-04
Effort: 2 hours planning + ongoing client migration

# Identify active v3.x clients (check nginx logs)
ssh root@94.23.26.113
docker exec nginx-proxy tail -1000 /var/log/nginx/access.log \
  | grep 'w3.pinbox24.com' | awk '{print $7}' | sort | uniq -c | sort -rn | head -20

Output informs whether v3.x still has active traffic. If yes, a client migration plan is required. If no, v3.x containers can be stopped during the bms-1 OS migration.


Phase 4 Completion Criteria

  • bms-1 running on Ubuntu 24.04; all production services migrated and stable
  • Old bms-1 (94.23.26.113) decommissioned
  • bms-3 running on Ubuntu 24.04 or upgrade scheduled with rs0 maintenance plan
  • bms-4 running on Ubuntu 24.04 or upgrade scheduled
  • bms-2 Claude Code installed; AI-Dev-OV1 operational
  • v3.x sunset plan documented and client migration in progress

8. Server-by-Server Modernization Checklist

bms-1 (94.23.26.113) — Pinbox24 Production

PhaseTaskStatus
P1Export untagged Docker images to Wasabi[ ] Pending
P1Set up off-server backup (PostgreSQL, Redis, Docker config)[ ] Pending
P1Disk cleanup — target below 70%[ ] Pending
P1OS migration plan written[ ] Pending
P2PM2 NodeChat investigated and documented[ ] Pending
P2Port 8081 identified and documented[ ] Pending
P2Portainer v1 replaced[ ] Pending
P2Deprecated container s3-v42-prod-02-25-old removed[ ] Pending
P3logrotate configured for /var/log[ ] Pending
P3Trivy CVE scan run; results reviewed[ ] Pending
P4OS migration to Ubuntu 24.04 (new server)[ ] Pending
P4v3.x sunset plan completed[ ] Pending

bms-2 (145.239.133.104) — MongoDB Observer + AI-Dev-OV1

PhaseTaskStatus
P2node_exporter installed and in Prometheus[ ] Pending
P2UFW firewall configured (port 27017 restricted)[ ] Pending
P3fail2ban installed and active[ ] Pending
P3unattended-upgrades enabled[ ] Pending
P3claude-admin user created[ ] Pending
P4Claude Code installed (AI-Dev-OV1 active)[ ] Pending
P4Ubuntu 22.04 → 24.04 upgrade plan written[ ] Pending

bms-3 (51.68.155.224) — MongoDB PRIMARY + Staging

PhaseTaskStatus
P1MongoDB memory alert added to Prometheus rules[ ] Pending
P2node_exporter installed and in Prometheus[ ] Pending
P2UFW firewall configured (port 27017 restricted)[ ] Pending
P2MongoDB WiredTiger cache limited to 8 GB[ ] Pending
P3fail2ban installed and active[ ] Pending
P3unattended-upgrades enabled[ ] Pending
P3claude-admin user created[ ] Pending
P3Staging containers migrated to bms-4 (RAM relief)[ ] Pending
P4Ubuntu 22.04 → 24.04 upgrade (with rs0 maintenance)[ ] Pending

bms-4 (54.36.123.110) — MongoDB Arbiter + Docker Host

PhaseTaskStatus
P2rs.addArb completed (HUMAN ACTION REQUIRED)[ ] Pending
P2n8n migrated from vps-h1[ ] Pending
P2docker-compose.yml deployed to server[ ] Pending
P3fail2ban installed and active[ ] Pending
P3unattended-upgrades enabled[ ] Pending
P3claude-admin user created[ ] Pending
P4Ubuntu 22.04 → 24.04 upgrade[ ] Pending

9. Risk Register

RiskLikelihoodImpactMitigation
bms-1 disk reaches 100% before cleanup completesHighCritical — production outageEmergency: docker system prune -f frees image cache immediately (~20 GB typical). Then execute Task 1.3.
bms-3 MongoDB OOM kill during staging load testMediumHigh — rs0 election, potential write downtimeWiredTiger cache limit (Task 2.7) reduces this. Alert (Task 1.5) provides warning.
Untagged image container (v32-prod-socket etc.) stops before exportMediumCritical — cannot restartFreeze policy: do not restart bms-1 for any reason until Task 1.1 complete.
bms-1 OS migration fails mid-flightLowCritical — production downUse new-server migration (not in-place). Rollback = DNS repoint back to old bms-1 (TTL 60s).
rs.addArb (bms-4) fails due to keyFile mismatchLowMedium — arbiter not joiningkeyFile md5 verified before attempt; fallback: rs0 runs with 2 data members + no arbiter (no immediate data risk).
Wasabi credentials unavailable for image exportLowHigh — Task 1.1 blockedUse SCP to local machine as backup: scp root@94.23.26.113:/tmp/*.tar /tmp/
SSH lockout during UFW hardeningLowHigh — server inaccessibleKeep active session open; OVH IPMI/KVM access (server ID 1823494) as recovery path.
Docker pull fails on v3.x images from private registryMediumHigh — v3.x cannot be recovered if containers stopComplete Task 1.1 export before any v3.x container restarts.
bms-3 Ubuntu 22.04 EOL surprise (patches stop April 2027)Certain (known date)Medium — same as bms-1 todayPhase 4 planned before that date.

10. Success Criteria

Phase 1 Complete

  • All P1 issues resolved on all servers
  • bms-1 has verified off-server backup in Wasabi
  • bms-1 disk below 70%
  • bms-1 untagged images safe in Wasabi
  • bms-3 MongoDB memory alert active in Grafana

Phase 2 Complete

  • All P2 issues resolved on all servers
  • bms-2 and bms-3 visible in Prometheus and Grafana
  • MongoDB port 27017 not accessible from internet on any server
  • rs0 has 3 members including bms-4 arbiter
  • n8n running on bms-4, stable for 7 days
  • bms-3 MongoDB has > 15 GB free RAM

Phase 3 Complete

  • fail2ban active on all servers (bms-2, bms-3, bms-4; bms-1 new server)
  • unattended-upgrades enabled on all servers
  • Trivy scan complete; no unmitigated CRITICAL CVEs in production containers
  • claude-admin user set up on bms-2, bms-3, bms-4
  • All P3 issues resolved

Phase 4 Complete

  • bms-1 running on Ubuntu 24.04 (new server)
  • bms-3 running on Ubuntu 24.04 or upgrade scheduled with written plan
  • bms-4 running on Ubuntu 24.04 or upgrade scheduled with written plan
  • bms-2 Claude Code installed and AI-Dev-OV1 operational
  • All BMS servers on supported OS versions
  • No server within 6 months of OS EOL without a migration plan

Overall Modernization Complete

  • infra_docs_check audit action passes for all 4 BMS servers
  • All servers have compliance_workbook = 'yes' in dev_r_services
  • Zero P1 or P2 issues open in the priority issues register above

Appendix: Effort Summary

PhaseEstimated effortKey dependenciesHuman action required
Phase 12 daysNonePinbox24 maintenance window coordination
Phase 23 daysPhase 1 completeMongoDB admin password for rs.addArb
Phase 32 daysPhase 2 completeNone
Phase 43–5 daysPhase 3 complete; Phases 1.4 design approvedDNS cut-over decision for bms-1 migration
Total10–12 days
  • docs/servers/p4-ovh-bms-1-ns367522-operations.md — bms-1 operations workbook
  • docs/servers/p4-ovh-bms-2-ns3087638-operations.md — bms-2 operations workbook
  • docs/servers/p4-ovh-bms-3-ns3129867-operations.md — bms-3 operations workbook
  • docs/servers/p4-ovh-bms-4-ns3101999-operations.md — bms-4 operations workbook
  • docs/improvements/01-backups.md — backup spec (applies to BMS servers)
  • docs/improvements/08-image-cve-scanning.md — CVE scanning spec
  • docs/improvements/09-ssh-hardening.md — SSH hardening spec (adapt for BMS)
  • bms-4/docker-compose.yml — bms-4 Docker Compose configuration