BMS Modernization Plan
Date: 2026-06-14
Scope: All OVH Kimsufi bare metal servers — bms-1 through bms-4
Author: Claude Code (p24-infra admin role)
Status: Active planning document — update as phases complete
1. Executive Summary
Five critical findings require immediate action before any new work proceeds:
-
bms-1: Ubuntu 20.04 EOL since April 2025 — running live production. No security patches for 14+ months. Every known CVE published since April 2025 is unmitigated on the server hosting Pinbox24 production traffic.
-
bms-1: No off-server backup exists. 24 production Docker containers, a PostgreSQL database, a Redis instance, and a PM2 service have zero off-site backup. A single disk failure, ransomware event, or OVH hardware incident would be an unrecoverable data loss.
-
bms-1: Three containers (
v32-prod-socket,v32-prod-reso,v41-prod) run from locally-stored untagged images with no registry source. If these containers stop for any reason — host reboot, OOM kill, Docker daemon restart — they cannot be restarted. They are live production services. -
bms-3: MongoDB using 21.7 GB RAM on a 32 GB server shared with 11 Docker containers. Free RAM is approximately 10 GB. Any memory spike — Docker build, staging load test, container restart — risks OOM-killing the MongoDB primary, triggering an unplanned rs0 election.
-
Monitoring gap: bms-2 and bms-3 have no node_exporter and are not in Prometheus. Issues on these servers — disk fill, RAM exhaustion, high CPU — are invisible until they cause an outage.
Immediate actions (before any phase work):
- Freeze bms-1: no restarts, no new deploys until Phase 1 is complete
- Export and safely store the three untagged Docker images from bms-1
- Install node_exporter on bms-2 and bms-3 and connect to Prometheus
2. Current State Assessment
Security Scorecard
| Category | bms-1 | bms-2 | bms-3 | bms-4 |
|---|---|---|---|---|
| OS currency | CRITICAL (Ubuntu 20.04 EOL Apr 2025) | Good (Ubuntu 24.04 LTS) | Fair (Ubuntu 22.04, EOL Apr 2027) | Fair (Ubuntu 22.04, EOL Apr 2027) |
| SSH security | Poor (root login, no fail2ban, no ufw) | Fair (ubuntu user, no fail2ban) | Fair (ubuntu user, no fail2ban) | Poor (root-only login, no fail2ban) |
| Firewall | Poor (ufw inactive, iptables partial) | Unknown | Unknown | Unknown |
| Docker security | Poor (Docker 24, old; Portainer v1 legacy) | N/A (no Docker) | Fair (Docker present, Portainer legacy) | Good (Docker CE 29.5.3 current) |
| MongoDB security | N/A | Good (auth + keyFile, port 0.0.0.0 concern) | Good (auth + keyFile, port 0.0.0.0 concern) | Good (auth + keyFile, arbiter only) |
| Off-server backup | CRITICAL (none) | Good (no stateful data) | Fair (MongoDB not explicitly backed up) | Good (no production data yet) |
| Monitoring coverage | Partial (node_exporter installed 2026-06-14) | None | None | Good (node_exporter + cadvisor) |
| Secrets management | Unknown (env vars in containers, not inventoried) | Fair (.env.local reference) | Unknown | Fair (.env on server) |
| Unattended upgrades | Off (EOL OS, moot) | Unknown | Unknown | Unknown |
| Disk encryption | None (OVH bare metal, no LUKS) | None | None | None |
| Log aggregation | None (Netdata local only) | None | None | None |
| Image tagging | CRITICAL (3 untagged images) | N/A | Fair (all from ECR) | Good (all from registries) |
Summary Score (1 = critical, 5 = good)
| Server | Score | Primary concern |
|---|---|---|
| bms-1 | 1/5 | EOL OS + no backups + untagged images |
| bms-2 | 3/5 | Monitoring gap + pending claude-admin setup |
| bms-3 | 2/5 | MongoDB OOM risk + no monitoring + Ubuntu 22.04 lifecycle |
| bms-4 | 4/5 | Newly provisioned — pending n8n migration + hardening |
3. Priority Issues Register
All issues across all servers sorted by priority.
P1 — Critical: Risk of data loss or unrecoverable service failure
| ID | Server | Issue | Impact |
|---|---|---|---|
| P1-01 | bms-1 | Ubuntu 20.04 EOL (April 2025) — no security patches for 14 months | Active CVE exposure on live production |
| P1-02 | bms-1 | No off-server backup of any service | Total data loss on hardware failure |
| P1-03 | bms-1 | Three untagged Docker images in use (v32-prod-socket, v32-prod-reso, v41-prod) | Unrecoverable service failure on any container stop |
| P1-04 | bms-1 | Disk at 85% used — ~170 GB unreviewed backups in /root | Will reach 100% again; production containers stop when disk is full |
| P1-05 | bms-3 | MongoDB PRIMARY using 21.7 GB of 32 GB RAM shared with 11 containers | OOM kill of MongoDB PRIMARY → unplanned rs0 election |
P2 — High: Risk of unplanned downtime or significant security exposure
| ID | Server | Issue | Impact |
|---|---|---|---|
| P2-01 | bms-1 | Portainer v1 (5 years old) exposed on port 49154 | Unpatched management UI; known CVEs in Portainer v1 |
| P2-02 | bms-1 | s3-v42-prod-02-25-old deprecated container still running | Resource waste; attack surface |
| P2-03 | bms-1 | PM2 NodeChat v1.0.0 at :3001 — 4 years running, unknown traffic | If serving live traffic, no recovery plan if it crashes |
| P2-04 | bms-1 | Port :8081 — unknown Node.js process | Unknown service; potential security exposure |
| P2-05 | bms-1 | PostgreSQL and Redis running host-native, not backed up | Silent data loss path |
| P2-06 | bms-1 | ufw inactive, no firewall policy | All ports reachable from internet by default |
| P2-07 | bms-2 | No node_exporter / Prometheus monitoring | Blind to disk fill, RAM, CPU on MongoDB observer |
| P2-08 | bms-3 | No node_exporter / Prometheus monitoring | Blind to impending OOM on MongoDB PRIMARY |
| P2-09 | bms-3 | MongoDB port 27017 bound to 0.0.0.0 — firewall state unknown | Potential direct internet exposure of database port |
| P2-10 | bms-2 | MongoDB port 27017 bound to 0.0.0.0 — firewall state unknown | Potential direct internet exposure of database port |
| P2-11 | bms-4 | n8n migration from vps-h1 pending — workflows running on unrelated host | Split risk: n8n failure on vps-h1 affects bms-4 plans |
| P2-12 | bms-4 | rs.addArb (MongoDB arbiter join) pending — rs0 has no quorum node | rs0 is running without its intended 3-node quorum |
P3 — Medium: Security hardening not yet implemented
| ID | Server | Issue | Impact |
|---|---|---|---|
| P3-01 | ALL | No fail2ban on any server | Brute-force SSH attempts unchecked |
| P3-02 | ALL | No unattended-upgrades configured | Security patches require manual application |
| P3-03 | bms-1 | Root SSH login with direct key access | Weaker access model; violates principle of least privilege |
| P3-04 | bms-4 | Root-only SSH, no claude-admin user | Inconsistent with bms-2/bms-3 access model |
| P3-05 | bms-1 | GitLab runner active — unclear if used alongside git-deploy container | Parallel deploy paths; audit and consolidation needed |
| P3-06 | bms-3 | Disk at 44% — no automated alert below 70% | Risk of undetected disk fill as staging workloads grow |
| P3-07 | ALL | No centralized log aggregation (Loki/Promtail) | Post-incident forensics impossible; no log-based alerts |
| P3-08 | bms-1 | v3.x containers using 4–5-year-old images from private registry | Potentially unreachable registry; severe unpatched vulnerabilities |
| P3-09 | bms-3 | Staging containers with ECR images up to 6 months since pull | Silent vulnerability drift |
| P3-10 | bms-1 | /var/log at 16 GB — no logrotate configured | Will contribute to disk fill |
| P3-11 | bms-1 | Netdata running locally but not integrated with Prometheus | Wasted monitoring resource; double agent overhead |
| P3-12 | ALL | No Docker image CVE scanning (Trivy/Grype) | Vulnerabilities in deployed containers undiscovered |
| P3-13 | bms-2 | claude-admin user not yet created | Claude agent access not configured |
| P3-14 | bms-3 | claude-admin user not yet created | Claude agent access not configured |
P4 — Low: Modernization and lifecycle management
| ID | Server | Issue | Impact |
|---|---|---|---|
| P4-01 | bms-3 | Ubuntu 22.04 LTS — EOL April 2027 (10 months away) | Plan migration before EOL window; MongoDB downtime risk if rushed |
| P4-02 | bms-4 | Ubuntu 22.04 LTS — EOL April 2027 | Same as P4-01 |
| P4-03 | bms-1 | Pinbox24 v3.x (legacy stack) — clients still on v3.x | Technical debt; v3.x images 4–5 years old, EOL stack |
| P4-04 | bms-1 | Registry consolidation — v3.x from private-registry.dev.pinbox24.com (possibly unreachable) vs v4.x from ECR | Risk of being unable to pull images for v3.x recovery |
| P4-05 | bms-2 | Claude Code not yet installed (AI-Dev-OV1 pending) | bms-2 not operational as AI dev environment |
| P4-06 | bms-1 | Docker 24 (old) — current is 28.x | Old Docker misses security fixes and compose improvements |
| P4-07 | bms-3 | Portainer v1 (legacy) | Same as bms-1 |
| P4-08 | bms-1 | OpenBao secrets migration — env vars in containers not managed | No rotation capability; credential sprawl |
| P4-09 | ALL | Disk encryption (LUKS at rest) not implemented | Physical server access bypasses all OS security |
4. Phase 1: Critical Security
Goal: Eliminate all P1 issues across all servers before proceeding.
Target completion: Within 2 weeks of plan approval.
Prerequisite: Maintenance window agreement with Pinbox24 operators for bms-1 work.
Task 1.1 — bms-1: Export and re-tag untagged Docker images
Priority: P1-03
Effort: 2 hours
Risk: Low (read-only operation, no service changes)
Rollback: N/A (no changes to running services)
Three containers run from locally-stored image IDs with no registry source. They cannot be recovered after a container stop or host reboot.
ssh root@94.23.26.113
# Identify image IDs for the three untagged containers
docker inspect v32-prod-socket v32-prod-reso v41-prod \
--format '{{.Name}}: Image={{.Config.Image}} ID={{.Image}}'
# Export each image to a tar archive
docker save $(docker inspect v32-prod-socket --format '{{.Image}}') \
-o /tmp/v32-prod-socket-image-$(date +%Y%m%d).tar
docker save $(docker inspect v32-prod-reso --format '{{.Image}}') \
-o /tmp/v32-prod-reso-image-$(date +%Y%m%d).tar
docker save $(docker inspect v41-prod --format '{{.Image}}') \
-o /tmp/v41-prod-image-$(date +%Y%m%d).tar
# Upload to Wasabi for safekeeping
# (requires AWS CLI configured with p24-infra Wasabi credentials)
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
/tmp/v32-prod-socket-image-*.tar s3://p24-infra/docker-image-exports/bms-1/
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
/tmp/v32-prod-reso-image-*.tar s3://p24-infra/docker-image-exports/bms-1/
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
/tmp/v41-prod-image-*.tar s3://p24-infra/docker-image-exports/bms-1/Acceptance criteria:
- All three tar files confirmed in Wasabi under
docker-image-exports/bms-1/ - Verify images can be restored:
docker load -i /tmp/v41-prod-image-*.tar
Task 1.2 — bms-1: Install automated off-server backup
Priority: P1-02
Effort: 4 hours
Risk: Medium (requires PostgreSQL and Redis audit to understand what data needs backing up)
Rollback: Stop the backup cron job; no service changes
Before any OS migration, an off-server backup must exist. This is a prerequisite for all other bms-1 Phase 1 work.
ssh root@94.23.26.113
# Audit what data exists
# PostgreSQL
sudo -u postgres psql -c '\l' # list databases
sudo -u postgres psql -c '\du' # list users
# Redis
redis-cli info keyspace # which databases have keys
# PM2 NodeChat - what data does it use?
pm2 list
cat /temp/p24-v-3.2/ecosystem.config.js 2>/dev/null || ls /temp/p24-v-3.2/
# MongoDB dumps in /root - verify they are current and complete
ls -lh /root/w3-2026-02-05/ /root/w4-2026-02-23/ /root/w4-2026-02-24/Backup script (/opt/p24-infra/scripts/backup-bms1.sh) — to be created:
#!/bin/bash
set -euo pipefail
BACKUP_DATE=$(date +%Y-%m-%d)
BACKUP_DIR="/tmp/bms1-backup-${BACKUP_DATE}"
S3_PREFIX="s3://p24-infra/bms-1-backups/${BACKUP_DATE}"
mkdir -p "$BACKUP_DIR"
# 1. PostgreSQL dump
sudo -u postgres pg_dumpall | gzip > "${BACKUP_DIR}/postgres-all.sql.gz"
# 2. Redis dump (BGSAVE, then copy RDB)
redis-cli BGSAVE
sleep 5
cp /var/lib/redis/dump.rdb "${BACKUP_DIR}/redis-dump.rdb"
# 3. Docker volumes for nginx-proxy certs (Let's Encrypt)
docker cp nginx-proxy:/etc/nginx/certs "${BACKUP_DIR}/nginx-certs" 2>/dev/null || true
# 4. Container environment files (redacted - just keys, no values)
for ctr in v42-prod s3-v42-prod mailgun-v42-prod; do
docker inspect "$ctr" --format '{{range .Config.Env}}{{println .}}{{end}}' \
| grep -oP '^[A-Z0-9_]+=' > "${BACKUP_DIR}/${ctr}-env-keys.txt" 2>/dev/null || true
done
# 5. Upload to Wasabi
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 sync \
"${BACKUP_DIR}/" "${S3_PREFIX}/" --sse AES256
# 6. Cleanup
rm -rf "${BACKUP_DIR}"
echo "Backup complete: ${S3_PREFIX}"# Install cron (daily at 01:00 UTC)
echo "0 1 * * * /opt/p24-infra/scripts/backup-bms1.sh >> /var/log/bms1-backup.log 2>&1" \
| crontab -l 2>/dev/null | { cat; echo "0 1 * * * /opt/p24-infra/scripts/backup-bms1.sh >> /var/log/bms1-backup.log 2>&1"; } | crontab -Acceptance criteria:
- First manual backup run completes without error
- Files visible in Wasabi
p24-infra/bms-1-backups/ - PostgreSQL can be restored from dump in a test environment
Task 1.3 — bms-1: Emergency disk cleanup
Priority: P1-04
Effort: 2 hours
Risk: Medium — must not delete backups that are still the only copy of live data
Rollback: Cleanup is irreversible; verify each item before deletion
Disk is at 85% (354 GB of 440 GB). At 100%, running containers will crash on any write attempt.
Step 1 — Identify candidates (no deletions yet):
ssh root@94.23.26.113
# Largest directories
du -sh /root/*/ | sort -rh | head -20
du -sh /var/lib/docker/ | head -5
docker system df # Docker image/volume space usage
# Log files
du -sh /var/log
ls -lh /var/log/
# Old Docker images not used by any container
docker images --filter "dangling=true" --format "{{.ID}} {{.Size}}"Step 2 — Safe cleanup (after Task 1.2 backup is confirmed):
# Remove deprecated container
docker stop s3-v42-prod-02-25-old
docker rm s3-v42-prod-02-25-old
# Remove dangling images
docker image prune -f
# Remove unused volumes
docker volume prune -f
# Compress and archive /root MongoDB dumps to Wasabi, then delete local copies
# (verify they are already on Wasabi or upload first)
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
/root/w3-2026-02-05/ s3://p24-infra/bms-1-mongo-dumps/ --recursive
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
/root/w4-2026-02-23/ s3://p24-infra/bms-1-mongo-dumps/ --recursive
aws --endpoint-url https://s3.eu-central-2.wasabisys.com s3 cp \
/root/w4-2026-02-24/ s3://p24-infra/bms-1-mongo-dumps/ --recursive
# After confirming upload:
rm -rf /root/w3-2026-02-05 /root/w4-2026-02-23 /root/w4-2026-02-24
# Configure logrotate for /var/log
cat > /etc/logrotate.d/bms1-custom << 'EOF'
/var/log/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
}
EOF
logrotate -f /etc/logrotate.d/bms1-customTarget state: Disk below 60% (264 GB used). This leaves 176 GB of headroom.
Acceptance criteria:
df -h /dev/md127shows < 70% used- All production containers still running after cleanup:
docker ps | grep -v Exited - MongoDB dumps confirmed in Wasabi before local deletion
Task 1.4 — bms-1: OS Migration Planning (Ubuntu 20.04 → 24.04)
Priority: P1-01
Effort: Planning 4 hours; execution 8+ hours with maintenance window
Risk: CRITICAL — this is live production. In-place upgrade is NOT recommended.
Ubuntu 20.04 reached end-of-life in April 2025. An in-place do-release-upgrade on a server running 24 production containers with complex dependencies carries high risk of failure. The recommended approach is a parallel migration:
Recommended strategy: New server migration (not in-place upgrade)
- Provision a new OVH Kimsufi server with Ubuntu 24.04
- Migrate services one-by-one using blue-green DNS switching
- Keep bms-1 running as fallback for 2 weeks after migration
- Decommission bms-1 once all traffic confirmed on new server
Phase 1.4 delivers: A detailed migration plan (tracked as a separate document). The actual migration executes in Phase 2.
Migration prerequisites checklist:
- Task 1.1 complete (untagged images exported and safe)
- Task 1.2 complete (backup running)
- Task 1.3 complete (disk < 70%)
- Port
:8081identified (Task 2.4) - PM2 NodeChat traffic audit complete (Task 2.3)
- PostgreSQL and Redis data fully understood and backed up
- GitLab runner → git-deploy pipeline documented
- AWS ECR credentials confirmed available on new server
- Private registry (private-registry.dev.pinbox24.com) accessibility verified for v3.x images
Key decisions required from Pinbox24 operators before migration:
- Is v3.x (w3.pinbox24.com) still serving active clients? Can it be migrated together with v4.x or after?
- Is
NodeChaton:3001live traffic? If yes, who owns it? - What is the acceptable maintenance window duration for the migration?
Task 1.5 — bms-3: Add MongoDB memory alert
Priority: P1-05
Effort: 1 hour
Risk: None (monitoring addition only)
Rollback: Remove alert rule
Immediate mitigation while a longer-term RAM strategy is evaluated.
# On vps-i1, add alert to prometheus rules
ssh root@217.154.82.162
cat >> /opt/p24-infra/monitoring/prometheus/rules/infrastructure.yml << 'EOF'
- alert: BMS3MongoDBHighMemory
expr: |
node_memory_MemAvailable_bytes{server="p4-ovh-bms-3-ns3129867"}
/ node_memory_MemTotal_bytes{server="p4-ovh-bms-3-ns3129867"} < 0.15
for: 5m
labels:
severity: critical
annotations:
summary: "bms-3 free RAM below 15% — MongoDB OOM risk"
description: "bms-3 has less than 15% free RAM. MongoDB PRIMARY at risk of OOM kill."
EOFNote: This alert becomes active once bms-3 node_exporter is connected (Task 2.1). For now, manually monitor:
ssh ubuntu@51.68.155.224 "free -h && ps aux --sort=-%mem | head -5"Phase 1 Rollback Summary
| Task | Rollback procedure |
|---|---|
| 1.1 Image export | None needed (read-only). If Wasabi upload fails, re-run upload. |
| 1.2 Backup script | crontab -e, remove backup line. Script leaves services untouched. |
| 1.3 Disk cleanup | MongoDB dumps: retrieve from Wasabi (aws s3 cp). Docker containers: already removed intentionally. |
| 1.4 OS migration | Migration is on a new server; bms-1 unchanged until DNS cut-over. DNS rollback: repoint DNS records back to bms-1 IP within 60 seconds. |
| 1.5 Alert | kubectl delete or edit the rules YAML and reload Prometheus. |
Phase 1 Completion Criteria
- bms-1 untagged images exported and confirmed in Wasabi
- bms-1 daily backup running; first successful backup confirmed in Wasabi
- bms-1 disk below 70%
- bms-3 MongoDB memory alert defined (fires when < 15% free RAM)
- bms-1 OS migration plan written and approved
5. Phase 2: Stability and High Risk
Goal: Eliminate all P2 issues. All servers stable with no single points of failure.
Target completion: Within 6 weeks of Phase 1 completion.
Task 2.1 — bms-2 and bms-3: Install node_exporter and connect to Prometheus
Priority: P2-07, P2-08
Effort: 1 hour
Risk: Low
Rollback: Stop and disable node_exporter service
# bms-2
ssh ubuntu@145.239.133.104
sudo apt-get update && sudo apt-get install -y prometheus-node-exporter
sudo systemctl enable --now prometheus-node-exporter
# Verify
curl http://localhost:9100/metrics | head -3
# bms-3
ssh ubuntu@51.68.155.224
sudo apt-get update && sudo apt-get install -y prometheus-node-exporter
sudo systemctl enable --now prometheus-node-exporter
curl http://localhost:9100/metrics | head -3Add to monitoring/prometheus/prometheus.yml under the node job:
- targets: ['145.239.133.104:9100'] # p4-ovh-bms-2-ns3087638 — MongoDB rs0 observer + AI-Dev-OV1
labels: { env: production, server_type: baremetal, server: p4-ovh-bms-2-ns3087638, location: ovh-fr }
- targets: ['51.68.155.224:9100'] # p4-ovh-bms-3-ns3129867 — MongoDB rs0 PRIMARY + staging
labels: { env: production, server_type: baremetal, server: p4-ovh-bms-3-ns3129867, location: ovh-fr }Reload Prometheus:
ssh root@217.154.82.162 "curl -sX POST http://localhost:9090/-/reload && echo OK"Acceptance criteria:
up{job="node", server=~"p4-ovh-bms-[23]-.*"}shows value1in Prometheus- Grafana “Servers Overview” dashboard shows bms-2 and bms-3 metrics
Task 2.2 — MongoDB port firewall hardening (bms-2 and bms-3)
Priority: P2-09, P2-10
Effort: 2 hours
Risk: Medium — incorrect firewall rules can break rs0 replication
Rollback: sudo ufw disable to immediately restore open access
MongoDB is bound to 0.0.0.0:27017 on both bms-2 and bms-3. Port 27017 should only accept connections from the three replica set members and the local machine.
rs0 member IPs that need MongoDB access:
| Server | IP |
|---|---|
| bms-2 | 145.239.133.104 |
| bms-3 | 51.68.155.224 |
| bms-4 | 54.36.123.110 |
# bms-3 — run as ubuntu with sudo
ssh ubuntu@51.68.155.224
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp comment 'SSH'
sudo ufw allow 80/tcp comment 'HTTP'
sudo ufw allow 443/tcp comment 'HTTPS'
# MongoDB: only from replica set members
sudo ufw allow from 145.239.133.104 to any port 27017 comment 'MongoDB bms-2'
sudo ufw allow from 54.36.123.110 to any port 27017 comment 'MongoDB bms-4'
sudo ufw allow from 127.0.0.1 to any port 27017 comment 'MongoDB localhost'
# node_exporter: only from Prometheus (vps-i1)
sudo ufw allow from 217.154.82.162 to any port 9100 comment 'Prometheus node_exporter'
# Enable (KEEP EXISTING SSH SESSION OPEN)
sudo ufw --force enable
sudo ufw status verboseApply identical rules on bms-2, substituting the bms-3 IP for bms-2’s self-referencing rules.
Critical: Keep an active SSH session open before running ufw enable. Open a second session immediately after to verify SSH still works.
Acceptance criteria:
sudo ufw statusshows active with correct rules on both servers- rs0 replication continues:
mongosh --eval 'rs.status()'on bms-3 shows all members healthy - Port 27017 is NOT reachable from an external IP:
nmap -p 27017 51.68.155.224from a machine not in the allowlist
Task 2.3 — bms-1: Investigate and document PM2 NodeChat and port 8081
Priority: P2-03, P2-04
Effort: 2 hours
Risk: Low (investigation only)
ssh root@94.23.26.113
# PM2 investigation
pm2 list
pm2 show 0 # details for NodeChat
cat /temp/p24-v-3.2/package.json
# Check if anything connects to :3001
netstat -tlnp | grep 3001
# Check nginx-proxy config for :3001
docker exec nginx-proxy cat /etc/nginx/conf.d/default.conf | grep -i 3001 || echo "Not in nginx"
# Port 8081 investigation
netstat -tlnp | grep 8081
lsof -i :8081
# Identify the process
ls -la /proc/$(lsof -ti :8081)/exe 2>/dev/nullExpected outcome: A documented entry in the bms-1 operations workbook for both services with:
- What the service does
- Whether it serves live traffic (via nginx-proxy or direct)
- Whether it can be safely stopped
- Migration plan if it is live traffic
Task 2.4 — bms-1: Set up Portainer v1 replacement or upgrade
Priority: P2-01
Effort: 2 hours
Risk: Low — Portainer is a management tool, not a production service
Rollback: docker start portainer-pinbox24
Portainer v1 is over 5 years old and has known CVEs. Port 49154 is exposed to the internet.
ssh root@94.23.26.113
# Stop and remove Portainer v1
docker stop portainer-pinbox24
docker rm portainer-pinbox24
# Option A: Install Portainer CE (v2+, free)
docker volume create portainer_data
docker run -d -p 49154:9000 --name portainer \
--restart=unless-stopped \
-v /var/run/docker.sock:/var/run/docker.sock \
-v portainer_data:/data \
portainer/portainer-ce:latest
# Option B: Add nginx-proxy auth and restrict to localhost only (more secure)
# Then access via SSH tunnel: ssh -L 49154:localhost:49154 root@94.23.26.113Recommended: Option B (localhost only with SSH tunnel) provides better security while preserving access. Option A upgrades functionality but keeps the port open.
Task 2.5 — bms-4: Complete n8n migration from vps-h1
Priority: P2-11
Effort: 4 hours
Risk: Medium — n8n workflows are live automation; must verify all workflows intact
Rollback: Repoint WAHA webhook back to n8n.vps-h1.infra.zintegrowana.online
Follow the migration procedure documented in docs/servers/p4-ovh-bms-4-ns3101999-operations.md under “n8n Migration from vps-h1”.
Pre-migration checklist:
- bms-4 docker-compose.yml deployed to server
-
.envfrom vps-h1 copied to bms-4 - bms-4 volumes created and data restored
- n8n accessible at
https://n8n.bms-4.infra.zintegrowana.online - All workflows visible and enabled in n8n UI
- Test execution of at least one workflow manually
Post-migration:
- WAHA webhook URL updated
- n8n on vps-h1 stopped (NOT removed for 7 days as fallback)
- Monitoring: n8n health check added to Prometheus/Grafana
Task 2.6 — bms-4: Complete MongoDB arbiter join (HUMAN ACTION REQUIRED)
Priority: P2-12
Effort: 30 minutes
Risk: Low — adding an arbiter does not affect data, only quorum
Rollback: rs.remove("54.36.123.110:27017") on bms-3
# Connect to bms-3 (holds MongoDB admin credentials)
ssh ubuntu@51.68.155.224
mongosh -u admin -p "$MONGODB_ADMIN_PASSWORD" --authenticationDatabase admin
# In mongosh:
rs.addArb("54.36.123.110:27017")
rs.remove("51.83.132.99:27017") # remove dead arbiter
rs.status()
# Expected: 3 members — 1 PRIMARY (bms-3), 1 SECONDARY (bms-2, non-voting), 1 ARBITER (bms-4)
exitNote: MongoDB admin password must be obtained from Pinbox24 application secrets or AWS Secrets Manager. This task requires human action (Claude does not have the MongoDB admin password).
Task 2.7 — bms-3: Evaluate MongoDB RAM usage and capacity planning
Priority: P1-05 mitigation
Effort: 2 hours analysis
Risk: None (analysis only)
With MongoDB using 21.7 GB of 32 GB on bms-3, the remaining 10.3 GB is shared between the OS, Docker daemon, and 11 containers.
ssh ubuntu@51.68.155.224
free -h
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}"Options to evaluate:
| Option | Pros | Cons |
|---|---|---|
| Set WiredTiger cache limit | Immediate RAM reduction | May impact MongoDB query performance |
| Move staging containers to bms-4 | Separates workloads cleanly | Requires staging URLs reconfiguration |
| Upgrade bms-3 to higher-RAM OVH server | No application changes | OVH migration cost and downtime |
| Promote bms-2 to PRIMARY, move workload | Distributes load | Complex rs0 reconfiguration |
Recommended: Set WiredTiger cache limit on bms-3 as immediate relief, then migrate staging containers to bms-4 (which has 1.8 TB disk and 32 GB RAM with only ~75 MB used by MongoDB arbiter).
# Immediate: limit WiredTiger cache to 8 GB (default is half of RAM)
# Edit /etc/mongod.conf on bms-3
sudo sed -i '/wiredTiger:/,/^[^ ]/{s/cacheSizeGB:.*/cacheSizeGB: 8/}' /etc/mongod.conf
# Or add if not present:
cat >> /etc/mongod.conf << 'EOF'
storage:
wiredTiger:
engineConfig:
cacheSizeGB: 8
EOF
sudo systemctl restart mongodAcceptance criteria:
db.serverStatus().wiredTiger.cache["maximum bytes configured"]in mongosh shows ~8 GB- Free RAM on bms-3 increases to > 15 GB after restart
Phase 2 Rollback Summary
| Task | Rollback procedure |
|---|---|
| 2.1 node_exporter | systemctl stop prometheus-node-exporter |
| 2.2 UFW hardening | ufw disable restores open access immediately |
| 2.3 PM2 / port 8081 | Investigation only, no rollback needed |
| 2.4 Portainer upgrade | docker stop portainer && docker run ... portainer/portainer:latest (v1 image) |
| 2.5 n8n migration | Update WAHA webhook back to vps-h1; restart n8n on vps-h1 |
| 2.6 MongoDB arbiter | rs.remove("54.36.123.110:27017") from bms-3 mongosh |
| 2.7 WiredTiger limit | Remove cacheSizeGB from mongod.conf and restart |
Phase 2 Completion Criteria
- bms-2 and bms-3 visible in Prometheus and Grafana
- Port 27017 not reachable from internet on bms-2 and bms-3
- PM2 NodeChat service fully documented and decision made (keep/decommission)
- Port 8081 identified and documented
- Portainer v1 replaced on bms-1
- n8n migrated to bms-4 and stable for 7 days
- rs0 has 3 functioning members including bms-4 arbiter
- bms-3 MongoDB cache limited to 8 GB; free RAM > 15 GB
6. Phase 3: Hardening
Goal: Implement security best practices across all servers. No blocking P3 issues.
Target completion: Within 12 weeks of Phase 1 completion.
Task 3.1 — ALL: Deploy fail2ban and SSH hardening via Ansible
Priority: P3-01, P3-03, P3-04
Effort: 1 day (follows spec docs/improvements/09-ssh-hardening.md)
The SSH hardening spec is already written. For BMS servers, adapt the VPS procedure:
For bms-3 and bms-2 (ubuntu user access):
# Install fail2ban
ssh ubuntu@51.68.155.224
sudo apt-get update && sudo apt-get install -y fail2ban
sudo cat > /etc/fail2ban/jail.local << 'EOF'
[sshd]
enabled = true
port = 22
filter = sshd
logpath = /var/log/auth.log
maxretry = 5
bantime = 3600
findtime = 600
EOF
sudo systemctl enable --now fail2ban
sudo fail2ban-client status sshdFor bms-1 (root-only access) — apply after Phase 2 (migration planned):
Given bms-1 is being migrated, defer SSH hardening to the new server setup.
For bms-4 (root-only access):
ssh root@54.36.123.110
apt-get update && apt-get install -y fail2ban
# Create claude-admin user for non-root access
useradd -m -s /bin/bash claude-admin
mkdir -p /home/claude-admin/.ssh
# Install VPS_SSH_PRIVATE_KEY public key
echo "<VPS_SSH_PRIVATE_KEY public part>" > /home/claude-admin/.ssh/authorized_keys
chmod 700 /home/claude-admin/.ssh && chmod 600 /home/claude-admin/.ssh/authorized_keys
chown -R claude-admin:claude-admin /home/claude-admin/.ssh
echo "claude-admin ALL=(ALL) NOPASSWD: /usr/bin/docker, /bin/systemctl, /bin/mkdir, /bin/chown, /bin/cp, /usr/bin/tee" \
| tee /etc/sudoers.d/claude-admin
# Test claude-admin access BEFORE disabling rootTask 3.2 — ALL: Configure unattended-upgrades
Priority: P3-02
Effort: 30 minutes per server
Risk: Low — unattended-upgrades is configured to only apply security updates, not dist-upgrades
# bms-2, bms-3, bms-4 (Ubuntu)
sudo apt-get install -y unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades # interactive: select Yes
# Verify
sudo unattended-upgrades --dry-run --debug 2>&1 | head -20Task 3.3 — bms-3: Deploy staging containers to bms-4
Priority: P3-06 (disk fill risk) + P1-05 mitigation (RAM)
Effort: 4 hours
Risk: Medium — requires updating all staging URLs and testing
Moving Pinbox24 staging containers (v31/v32/v41/v42) from bms-3 to bms-4 would:
- Free ~10+ GB RAM on bms-3 for MongoDB
- Use bms-4’s 1.8 TB disk (currently only 1% used)
- Separate MongoDB PRIMARY from staging workloads
# 1. On bms-3: export compose configuration
docker inspect v42-stage --format '{{json .Config.Env}}' > /tmp/v42-stage-env.json
docker inspect v41-stage --format '{{json .Config.Env}}' > /tmp/v41-stage-env.json
# ... repeat for all staging containers
# 2. Create bms-4/staging-docker-compose.yml in the repo
# 3. Deploy on bms-4
# 4. Update DNS: *.staging.pinbox24.com → 54.36.123.110
# 5. Verify staging works on new server
# 6. Stop staging containers on bms-3 after verification periodTask 3.4 — bms-1: Configure logrotate
Priority: P3-10
Effort: 30 minutes
Risk: None
ssh root@94.23.26.113
# Check current logrotate config
logrotate -d /etc/logrotate.conf 2>&1 | tail -20
# Fix large /var/log
cat > /etc/logrotate.d/bms1-nginx << 'EOF'
/var/log/nginx/*.log {
daily
rotate 14
compress
delaycompress
missingok
notifempty
sharedscripts
postrotate
docker exec nginx-proxy nginx -s reopen 2>/dev/null || true
endscript
}
EOF
# Force immediate rotation
logrotate -f /etc/logrotate.d/bms1-nginxTask 3.5 — ALL: Docker image CVE scanning with Trivy
Priority: P3-08, P3-09, P3-12
Effort: 1 day (follows spec docs/improvements/08-image-cve-scanning.md)
For bms-1 specifically — the v3.x images are 4–5 years old and will have critical CVEs. The scan results inform whether to emergency-upgrade or migrate clients off v3.x.
# Install Trivy on bms-1
ssh root@94.23.26.113
curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh \
| sh -s -- -b /usr/local/bin v0.51.0
# Scan all running container images
docker ps --format '{{.Image}}' | sort -u | while read img; do
echo "=== Scanning: $img ==="
trivy image --severity HIGH,CRITICAL "$img" 2>/dev/null | tail -20
doneTask 3.6 — bms-2 and bms-3: Create claude-admin user
Priority: P3-13, P3-14
Effort: 30 minutes per server
Follow the claude-admin setup procedure already documented in each server’s operations workbook. This enables Claude agent access for automated tasks.
Phase 3 Completion Criteria
- fail2ban active and configured on bms-2, bms-3, bms-4
- unattended-upgrades enabled on bms-2, bms-3, bms-4
- bms-4 has claude-admin user; bms-2 and bms-3 have claude-admin user
- Trivy scan complete on bms-1; CVE report reviewed; P1 CVEs addressed
- logrotate configured on bms-1; /var/log < 5 GB
- Staging containers migrated to bms-4 OR bms-3 MongoDB cache limited (at least one RAM remedy deployed)
7. Phase 4: Modernization
Goal: All servers on latest LTS. All software current. No legacy technical debt blocking new work.
Target completion: Before February 2027 (3 months before Ubuntu 22.04 EOL in April 2027).
Task 4.1 — bms-1: Complete OS migration to Ubuntu 24.04
Priority: P4 (planning in Phase 1, execution here)
Effort: 8–16 hours depending on complexity
Risk: HIGH — live production migration
This is the migration designed in Phase 1 Task 1.4. Key steps:
- Provision new OVH server with Ubuntu 24.04
- Install Docker CE (current), nginx-proxy, AWS ECR authentication
- Deploy v4.x stack (containers from ECR — cleanest)
- Deploy v3.x stack (from exported images — Task 1.1 export)
- Test all services in staging configuration
- Schedule maintenance window with Pinbox24 operators
- DNS cut-over with short TTL (60 seconds)
- Monitor for 24 hours on new server
- Decommission old bms-1 (keep for 7 days as fallback)
Zero-downtime strategy:
- Use nginx-proxy on both old and new server simultaneously
- Cut over one domain at a time, verifying each before proceeding
- Start with lowest-traffic service (v3.x staging variants)
- End with v42-prod (highest traffic, last to switch)
Task 4.2 — bms-3 and bms-4: Plan Ubuntu 22.04 → 24.04 upgrade
Priority: P4-01, P4-02
Effort: 4 hours planning; execution timed around MongoDB maintenance
Timeline: Must complete before April 2027
For bms-3 (MongoDB PRIMARY), the upgrade strategy requires careful rs0 coordination:
Phase 4.2 procedure:
1. Promote bms-2 to PRIMARY (rs.stepDown() on bms-3)
2. Perform do-release-upgrade on bms-3 (Ubuntu 22.04 → 24.04)
OR provision new bms-3-replacement and migrate
3. Re-join bms-3 to rs0 as SECONDARY
4. Verify rs0 healthy for 24h
5. Perform same upgrade on bms-4 (lower risk: arbiter only)
# Step 1: Force PRIMARY stepdown to bms-2 (run from bms-3)
# Note: bms-2 is observer (non-voting, priority 0) — need to temporarily raise priority
mongosh -u admin -p "$MONGODB_ADMIN_PASSWORD" --authenticationDatabase admin --eval '
cfg = rs.conf();
cfg.members[1].priority = 2; // bms-2 index in members array - verify first
rs.reconfig(cfg);
rs.stepDown();
'Note: bms-2 is configured as non-voting observer (priority 0). A PRIMARY failover with only one other data member (bms-2) and one arbiter requires careful configuration. Verify rs0 configuration before executing.
Task 4.3 — bms-2: Install Claude Code (AI-Dev-OV1)
Priority: P4-05
Effort: 2 hours
ssh ubuntu@145.239.133.104
# Install Claude Code
curl -fsSL https://claude.ai/install.sh | bash
# Or follow provision-vps playbook
# Copy OAuth credentials from local
scp ~/.claude/.credentials.json ubuntu@145.239.133.104:/tmp/
ssh ubuntu@145.239.133.104 "mkdir -p ~/.claude && mv /tmp/.credentials.json ~/.claude/"
# Verify
claude --versionTask 4.4 — bms-1 (new server): Docker version upgrade
Priority: P4-06
Effort: 0 additional hours (handled during Task 4.1 new server provisioning)
New bms-1 server will be provisioned with Docker CE current (29.x). The legacy Docker 24 on old bms-1 does not need upgrading — the server is being replaced.
Task 4.5 — bms-1: v3.x sunset planning
Priority: P4-03, P4-04
Effort: 2 hours planning + ongoing client migration
# Identify active v3.x clients (check nginx logs)
ssh root@94.23.26.113
docker exec nginx-proxy tail -1000 /var/log/nginx/access.log \
| grep 'w3.pinbox24.com' | awk '{print $7}' | sort | uniq -c | sort -rn | head -20Output informs whether v3.x still has active traffic. If yes, a client migration plan is required. If no, v3.x containers can be stopped during the bms-1 OS migration.
Phase 4 Completion Criteria
- bms-1 running on Ubuntu 24.04; all production services migrated and stable
- Old bms-1 (94.23.26.113) decommissioned
- bms-3 running on Ubuntu 24.04 or upgrade scheduled with rs0 maintenance plan
- bms-4 running on Ubuntu 24.04 or upgrade scheduled
- bms-2 Claude Code installed; AI-Dev-OV1 operational
- v3.x sunset plan documented and client migration in progress
8. Server-by-Server Modernization Checklist
bms-1 (94.23.26.113) — Pinbox24 Production
| Phase | Task | Status |
|---|---|---|
| P1 | Export untagged Docker images to Wasabi | [ ] Pending |
| P1 | Set up off-server backup (PostgreSQL, Redis, Docker config) | [ ] Pending |
| P1 | Disk cleanup — target below 70% | [ ] Pending |
| P1 | OS migration plan written | [ ] Pending |
| P2 | PM2 NodeChat investigated and documented | [ ] Pending |
| P2 | Port 8081 identified and documented | [ ] Pending |
| P2 | Portainer v1 replaced | [ ] Pending |
| P2 | Deprecated container s3-v42-prod-02-25-old removed | [ ] Pending |
| P3 | logrotate configured for /var/log | [ ] Pending |
| P3 | Trivy CVE scan run; results reviewed | [ ] Pending |
| P4 | OS migration to Ubuntu 24.04 (new server) | [ ] Pending |
| P4 | v3.x sunset plan completed | [ ] Pending |
bms-2 (145.239.133.104) — MongoDB Observer + AI-Dev-OV1
| Phase | Task | Status |
|---|---|---|
| P2 | node_exporter installed and in Prometheus | [ ] Pending |
| P2 | UFW firewall configured (port 27017 restricted) | [ ] Pending |
| P3 | fail2ban installed and active | [ ] Pending |
| P3 | unattended-upgrades enabled | [ ] Pending |
| P3 | claude-admin user created | [ ] Pending |
| P4 | Claude Code installed (AI-Dev-OV1 active) | [ ] Pending |
| P4 | Ubuntu 22.04 → 24.04 upgrade plan written | [ ] Pending |
bms-3 (51.68.155.224) — MongoDB PRIMARY + Staging
| Phase | Task | Status |
|---|---|---|
| P1 | MongoDB memory alert added to Prometheus rules | [ ] Pending |
| P2 | node_exporter installed and in Prometheus | [ ] Pending |
| P2 | UFW firewall configured (port 27017 restricted) | [ ] Pending |
| P2 | MongoDB WiredTiger cache limited to 8 GB | [ ] Pending |
| P3 | fail2ban installed and active | [ ] Pending |
| P3 | unattended-upgrades enabled | [ ] Pending |
| P3 | claude-admin user created | [ ] Pending |
| P3 | Staging containers migrated to bms-4 (RAM relief) | [ ] Pending |
| P4 | Ubuntu 22.04 → 24.04 upgrade (with rs0 maintenance) | [ ] Pending |
bms-4 (54.36.123.110) — MongoDB Arbiter + Docker Host
| Phase | Task | Status |
|---|---|---|
| P2 | rs.addArb completed (HUMAN ACTION REQUIRED) | [ ] Pending |
| P2 | n8n migrated from vps-h1 | [ ] Pending |
| P2 | docker-compose.yml deployed to server | [ ] Pending |
| P3 | fail2ban installed and active | [ ] Pending |
| P3 | unattended-upgrades enabled | [ ] Pending |
| P3 | claude-admin user created | [ ] Pending |
| P4 | Ubuntu 22.04 → 24.04 upgrade | [ ] Pending |
9. Risk Register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| bms-1 disk reaches 100% before cleanup completes | High | Critical — production outage | Emergency: docker system prune -f frees image cache immediately (~20 GB typical). Then execute Task 1.3. |
| bms-3 MongoDB OOM kill during staging load test | Medium | High — rs0 election, potential write downtime | WiredTiger cache limit (Task 2.7) reduces this. Alert (Task 1.5) provides warning. |
| Untagged image container (v32-prod-socket etc.) stops before export | Medium | Critical — cannot restart | Freeze policy: do not restart bms-1 for any reason until Task 1.1 complete. |
| bms-1 OS migration fails mid-flight | Low | Critical — production down | Use new-server migration (not in-place). Rollback = DNS repoint back to old bms-1 (TTL 60s). |
| rs.addArb (bms-4) fails due to keyFile mismatch | Low | Medium — arbiter not joining | keyFile md5 verified before attempt; fallback: rs0 runs with 2 data members + no arbiter (no immediate data risk). |
| Wasabi credentials unavailable for image export | Low | High — Task 1.1 blocked | Use SCP to local machine as backup: scp root@94.23.26.113:/tmp/*.tar /tmp/ |
| SSH lockout during UFW hardening | Low | High — server inaccessible | Keep active session open; OVH IPMI/KVM access (server ID 1823494) as recovery path. |
| Docker pull fails on v3.x images from private registry | Medium | High — v3.x cannot be recovered if containers stop | Complete Task 1.1 export before any v3.x container restarts. |
| bms-3 Ubuntu 22.04 EOL surprise (patches stop April 2027) | Certain (known date) | Medium — same as bms-1 today | Phase 4 planned before that date. |
10. Success Criteria
Phase 1 Complete
- All P1 issues resolved on all servers
- bms-1 has verified off-server backup in Wasabi
- bms-1 disk below 70%
- bms-1 untagged images safe in Wasabi
- bms-3 MongoDB memory alert active in Grafana
Phase 2 Complete
- All P2 issues resolved on all servers
- bms-2 and bms-3 visible in Prometheus and Grafana
- MongoDB port 27017 not accessible from internet on any server
- rs0 has 3 members including bms-4 arbiter
- n8n running on bms-4, stable for 7 days
- bms-3 MongoDB has > 15 GB free RAM
Phase 3 Complete
- fail2ban active on all servers (bms-2, bms-3, bms-4; bms-1 new server)
- unattended-upgrades enabled on all servers
- Trivy scan complete; no unmitigated CRITICAL CVEs in production containers
- claude-admin user set up on bms-2, bms-3, bms-4
- All P3 issues resolved
Phase 4 Complete
- bms-1 running on Ubuntu 24.04 (new server)
- bms-3 running on Ubuntu 24.04 or upgrade scheduled with written plan
- bms-4 running on Ubuntu 24.04 or upgrade scheduled with written plan
- bms-2 Claude Code installed and AI-Dev-OV1 operational
- All BMS servers on supported OS versions
- No server within 6 months of OS EOL without a migration plan
Overall Modernization Complete
infra_docs_checkaudit action passes for all 4 BMS servers- All servers have
compliance_workbook = 'yes'indev_r_services - Zero P1 or P2 issues open in the priority issues register above
Appendix: Effort Summary
| Phase | Estimated effort | Key dependencies | Human action required |
|---|---|---|---|
| Phase 1 | 2 days | None | Pinbox24 maintenance window coordination |
| Phase 2 | 3 days | Phase 1 complete | MongoDB admin password for rs.addArb |
| Phase 3 | 2 days | Phase 2 complete | None |
| Phase 4 | 3–5 days | Phase 3 complete; Phases 1.4 design approved | DNS cut-over decision for bms-1 migration |
| Total | 10–12 days |
Appendix: Related Documents
docs/servers/p4-ovh-bms-1-ns367522-operations.md— bms-1 operations workbookdocs/servers/p4-ovh-bms-2-ns3087638-operations.md— bms-2 operations workbookdocs/servers/p4-ovh-bms-3-ns3129867-operations.md— bms-3 operations workbookdocs/servers/p4-ovh-bms-4-ns3101999-operations.md— bms-4 operations workbookdocs/improvements/01-backups.md— backup spec (applies to BMS servers)docs/improvements/08-image-cve-scanning.md— CVE scanning specdocs/improvements/09-ssh-hardening.md— SSH hardening spec (adapt for BMS)bms-4/docker-compose.yml— bms-4 Docker Compose configuration