Service Distribution Evaluation — p24-infra
Date: 2026-06-14 Scope: All servers, SaaS, and services in the p24-infra ecosystem Status: Current state as of branch
fix/n8n-2.26.3-compose-cleanup; n8n migration to bms-4 is pending.
1. Executive Summary
Key findings
-
vps-h1 is critically overloaded for its size (2 vCPU / 7.8 GB RAM). With n8n capped at 1.5 CPU, Traefik, PostgreSQL, WAHA, cadvisor, and promtail all running, the host is at ~90–100% CPU burst capacity and likely 5–6 GB RAM consumed. A single traffic spike or scheduled workflow burst can cause container restarts or OOM events. This is the highest-priority risk in the entire infrastructure.
-
bms-4 (54.36.123.110) has enormous headroom. 8 vCPU, 32 GB RAM, 1.8 TB disk — with only the MongoDB arbiter (~75 MB RAM) and node_exporter running. It is the correct destination for n8n and new services.
-
pdf-service currently serves only internal tooling (Claude agents, audit-engine). It runs on vps-i1 and is not accessible to Pinbox24 production on bms-1. For Pinbox24 to use it, it must be deployed independently — bms-4 is the recommended location.
-
Convertio.ai replacement (PDF-to-JPG) should co-locate with the PDF generation service on bms-4. A Gotenberg-based or ImageMagick microservice is the lowest-friction option given existing Gotenberg expertise.
-
bms-2 (AI-Dev-OV1) and bms-3 are both underutilised for Docker workloads but bms-3 is constrained by MongoDB RAM (21.7 GB). Neither should receive additional critical services at this time.
-
WAHA is a critical service (WhatsApp incident management feed for the operations team). After n8n moves to bms-4, WAHA should also move to bms-4 — vps-h1 would then be idle and could be terminated to reduce cost.
Top recommendations (priority order)
| Priority | Action | Why |
|---|---|---|
| P1 | Migrate n8n from vps-h1 → bms-4 | vps-h1 is overloaded; bms-4 compose file is ready |
| P2 | Migrate WAHA from vps-h1 → bms-4 | After n8n moves, WAHA is the only remaining critical service |
| P3 | Deploy PDF generation + PDF-to-JPG on bms-4 | Pinbox24 production needs it; bms-4 has capacity |
| P4 | Install AI-Dev-BMS4 Claude agent on bms-4 | Infrastructure management agent for the new server |
| P5 | Decommission vps-h1 | Once n8n and WAHA are off, no remaining workloads justify ~10€/month |
| P6 | Add node_exporter + Prometheus scrape for bms-2 and bms-3 | Monitoring gap — neither is currently scraped |
2. Complete Service Inventory
2.1 vps-i1 — IONOS (217.154.82.162) — AlmaLinux 9.7 — 6 vCPU / 7.4 GB RAM / 239 GB
| Service | Type | Image / Source | RAM (est.) | CPU (est.) | Port(s) | Criticality |
|---|---|---|---|---|---|---|
| caddy | Docker | caddy:2.11.4-alpine | 50 MB | low | 80, 443 | Critical — TLS proxy for all infra services |
| prometheus | Docker | prom/prometheus:v3.12.0 | 300 MB | low | 127.0.0.1:9090 | High — metrics collection |
| thanos-sidecar | Docker | quay.io/thanos/thanos:v0.41.0 | 100 MB | low | 10901, 10902 | High — S3 block upload |
| thanos-query | Docker | quay.io/thanos/thanos:v0.41.0 | 100 MB | low | 127.0.0.1:10904 | Medium — unified PromQL |
| alertmanager | Docker | prom/alertmanager:v0.33.0 | 50 MB | minimal | 127.0.0.1:9093 | High — email alerts on failures |
| grafana | Docker | grafana/grafana:11.6.15 | 200 MB | low | 127.0.0.1:3000 | Medium — dashboards |
| renderer | Docker | grafana/grafana-image-renderer:v5.8.9 | 300 MB | burst | 127.0.0.1:8081 | Low — PNG for n8n daily report |
| queue-exporter | Docker (custom Python) | local build | 80 MB | minimal | 127.0.0.1:9200 | Medium — Supabase queue depths |
| cost-exporter | Docker (custom Python) | local build | 80 MB | minimal | 127.0.0.1:9210 | Low — billing tracking |
| backup-exporter | Docker (custom Python) | local build | 80 MB | minimal | 127.0.0.1:9220 | Low — backup status |
| pg-stats-exporter | Docker (custom Python) | local build | 80 MB | minimal | 127.0.0.1:9201 | Medium — slow query tracking |
| vercel-exporter | Docker (custom Python) | local build | 80 MB | minimal | 127.0.0.1:9202 | Low — Vercel metrics |
| n8n-cloud-exporter | Docker (custom Python) | local build | 80 MB | minimal | 127.0.0.1:9225 | Low — n8n.io cloud metrics |
| credential-exporter | Docker (custom Python) | local build | 80 MB | minimal | 127.0.0.1:9230 | Medium — rotation age tracking |
| blackbox-exporter | Docker | prom/blackbox-exporter:v0.28.0 | 50 MB | minimal | 127.0.0.1:9115 | Medium — endpoint probing |
| gotenberg | Docker | gotenberg/gotenberg:8.34.0 | 400 MB | burst | internal | High — PDF rendering engine |
| pdf-service | Docker (custom Python) | infra-src/gotenberg/pdf-service | 100 MB | low | 127.0.0.1:8100 | High — PDF API for agents/audit |
| p24-infra-mcp | Docker (custom Python) | infra-src/p24-infra-mcp | 80 MB | low | 127.0.0.1:8101 | High — MCP server for Claude agents |
| audit-engine | Docker (custom Python) | ../audit-engine | 150 MB | burst | 127.0.0.1:8200 | Medium — AI audit pipeline |
| uptime-kuma | Docker | louislam/uptime-kuma:1.23.16 | 200 MB | low | 127.0.0.1:3001 | Medium — endpoint uptime monitoring |
| loki | Docker | grafana/loki:3.7.2 | 200 MB | low | 127.0.0.1:3100 | Medium — log aggregation |
| promtail | Docker | grafana/promtail:3.6.11 | 50 MB | minimal | — | Low — log shipping from vps-i1 |
| traccar | Docker | traccar/traccar:latest | 500 MB | low | 8082, 5027/UDP | High — GPS fleet tracking |
| traccar-db | Docker | MySQL 8.0 | 400 MB | low | internal | High — Traccar data persistence |
| openclaw-gateway | Docker | custom | 200 MB | low | 18789-18790 | High — WhatsApp issue intake |
| node_exporter | systemd | system package | 20 MB | minimal | 9100 | Medium |
| GitHub Actions runner (ionos) | systemd | /opt/actions-runner | 200 MB | burst | — | High — CI/CD for et-operational-platform |
| GitHub Actions runner (KDP) | systemd | /opt/actions-runner-kdp | 200 MB | burst | — | Medium — CI for amazon-kdp-tango |
| claude-proxy.py | systemd (python3) | claude-proxy.py | 50 MB | minimal | 8765 | Medium — OpenClaw Claude proxy |
| claude-runner (agent) | user process | /usr/bin/claude | 300 MB | burst | — | Medium — autonomous agent (nightly) |
Estimated total: ~4.5–5 GB RAM, peaks to ~6 GB with Gotenberg/rendering bursts. Disk: ~239 GB — Prometheus data + Docker images are the main consumers. Needs monitoring.
2.2 vps-h1 — Hostinger (72.60.32.61) — Ubuntu 24.04 — 2 vCPU / 7.8 GB RAM / 96 GB
| Service | Type | Image | RAM (est.) | CPU limit | Port(s) | Criticality |
|---|---|---|---|---|---|---|
| traefik | Docker | traefik:v3.7.5 | 80 MB | none | 80, 443 | Critical — TLS proxy |
| n8n-postgres | Docker | postgres:16.9-alpine | 300 MB | none | internal | Critical — n8n DB |
| n8n | Docker | docker.n8n.io/n8nio/n8n:2.26.3 | 1.2 GB | 1.5 CPU | 127.0.0.1:5678 | Critical — automation engine |
| waha | Docker | devlikeapro/waha:noweb-2026.5.1 | 800 MB | none | 127.0.0.1:13000 | Critical — WhatsApp incidents |
| node-exporter | Docker | prom/node-exporter:v1.11.1 | 20 MB | none | host:9100 | Low |
| cadvisor | Docker | ghcr.io/google/cadvisor:v0.57.0 | 100 MB | none | 0.0.0.0:8080 | Low |
| promtail | Docker | grafana/promtail:3.6.11 | 50 MB | none | 9080 | Low |
| claude-runner (agent) | user process | /usr/bin/claude | 300 MB | none | — | Medium — nightly automation |
| GitHub Actions runner (hstgr) | systemd | /opt/actions-runner-hstgr | 200 MB | none | — | Medium — CI for et-operational-platform |
Estimated total: ~3.1 GB RAM baseline, peaks to 5–6 GB during n8n workflow bursts. WAHA alone uses ~800 MB. With n8n at 1.2 GB, these two services consume 2 GB of the 7.8 GB available. CPU is the critical constraint: 1.5 CPU cap on n8n leaves only 0.5 CPU for everything else — traefik, waha, cadvisor, node-exporter, OS.
2.3 bms-1 — OVH Kimsufi (94.23.26.113) — Ubuntu 20.04 EOL — 8 cores / 32 GB RAM
| Service | Type | RAM (est.) | Criticality |
|---|---|---|---|
| nginx-proxy + letsencrypt | Docker | 200 MB | Critical — TLS for all Pinbox24 services |
| portainer | Docker | 200 MB | Low — management UI |
| Pinbox24 v31-v42 (4 instances) | Docker (ECR) | ~4 GB total | Critical — production app |
| mailgun container | Docker | 100 MB | High — transactional email relay |
| pdf-gen (wkhtml) | Docker | 300 MB | High — internal PDF gen for Pinbox24 |
| git-deploy | Docker | 100 MB | Medium — deployment automation |
Known critical issue: disk 100% full. Pinbox24 production may fail on next container update. OS: Ubuntu 20.04 LTS — EOL April 2025. Security risk. No migration plan yet. No Prometheus monitoring.
2.4 bms-2 — OVH Kimsufi (145.239.133.104) — Ubuntu 24.04 — 8 vCPU / 32 GB RAM / 410 GB
| Service | Type | RAM (est.) | Criticality |
|---|---|---|---|
| mongod (rs0 observer, non-voting) | systemd | 2–4 GB | High — replica set read replica |
| claude-runner (AI-Dev-OV1 agent) | user process | 300 MB per agent | Medium — max 4 parallel |
Disk: 16% used (62/410 GB). Ample headroom. No Docker services deployed. No monitoring (node_exporter not installed).
2.5 bms-3 — OVH Kimsufi (51.68.155.224) — Ubuntu 22.04 — 8 vCPU / 32 GB RAM / 410 GB
| Service | Type | RAM (est.) | Criticality |
|---|---|---|---|
| mongod (rs0 PRIMARY) | systemd | 21.7 GB | Critical — Pinbox24 production DB |
| Pinbox24 v31-v42 staging (4 instances) | Docker (ECR) | ~3 GB | Medium — staging |
| traccar | Docker | 500 MB | Medium — GPS tracking (staging) |
| nginx-proxy + letsencrypt | Docker | 200 MB | Medium — TLS for staging |
| portainer-pinbox24 | Docker | 200 MB | Low |
| mt5 | Docker | 500 MB | Low |
WARNING: MongoDB at 21.7 GB RAM leaves only ~10 GB for all Docker workloads. At risk of OOM if staging load spikes. Disk: 44% used (170/410 GB). Monitor — staging logs can fill quickly. No Prometheus monitoring (node_exporter not installed).
2.6 bms-4 — OVH Kimsufi (54.36.123.110) — Ubuntu 22.04 — 8 vCPU / 32 GB RAM / 1.8 TB
| Service | Type | RAM (est.) | Status |
|---|---|---|---|
| mongod (rs0 arbiter) | systemd | ~75 MB | Active — awaiting rs.addArb() by human |
| prometheus-node-exporter | systemd | 20 MB | Active |
| traefik | Docker (planned) | 80 MB | In repo, not yet deployed |
| n8n-postgres | Docker (planned) | 300 MB | In repo, not yet deployed |
| n8n | Docker (planned) | 1.2 GB | Migration pending |
| node-exporter | Docker (planned) | 20 MB | In repo |
| cadvisor | Docker (planned) | 100 MB | In repo |
Disk: 1.7 TB free (8.3 GB / 1.8 TB used). No disk pressure ever expected. Free RAM after planned services: ~30 GB. Massive headroom for additional workloads.
2.7 SaaS Services
| Service | Provider | Plan | Role | Criticality |
|---|---|---|---|---|
| Supabase | Supabase | Pro | Fleet management DB, audit engine, DevOps tables | Critical |
| Vercel | Vercel | Team | et-operational-platform, p24-nextjs-v2026, portal | Critical |
| Cloudflare | Cloudflare | Free/Pro | DNS (zintegrowana.online), WAF for Workers | Critical |
| Wasabi S3 | Wasabi | Pay-as-you-go | Thanos metrics, PDFs, backup status JSON | High |
| GitHub | GitHub | Pro | Source control, CI/CD, issue tracking | Critical |
| Mailgun EU | Mailgun | Pay-as-you-go | Alertmanager email, Pinbox24 transactional | High |
| n8n.io Cloud | n8n.io | Paid | Separate automation instance (backups etc.) | Medium |
| Convertio.ai | Convertio | SaaS | PDF-to-JPG for Pinbox24 Angular | Scheduled for replacement |
3. vps-h1 Load Analysis
Current state
| Resource | Total | n8n | WAHA | Traefik+PG | Observability | OS+misc | Headroom |
|---|---|---|---|---|---|---|---|
| RAM | 7.8 GB | ~1.2 GB | ~800 MB | ~380 MB | ~170 MB | ~400 MB | ~2.8 GB |
| CPU | 2.0 vCPU | cap 1.5 | ~0.1–0.3 | ~0.1 | ~0.05 | ~0.1 | ~0 (negative on burst) |
| Disk | 96 GB | ~5–10 GB | ~2 GB | ~1 GB | minimal | OS | ~75 GB |
Risk assessment
Severity: HIGH — near capacity on CPU, moderate on RAM
-
CPU is the primary constraint. n8n has a hard cap at 1.5 CPU — this leaves only 0.5 CPU for the remaining 6 containers plus the OS kernel. When n8n executes multiple workflows in parallel (GPS sync, WhatsApp routing, daily report generation), it pegs the 1.5 CPU limit while everything else starves.
-
WAHA is a critical single point of failure. The WhatsApp integration is the primary incident reporting channel. If vps-h1 crashes or OOMs, incident reports from WhatsApp stop flowing to n8n and Supabase. WAHA does not have automatic reconnect to WhatsApp Web; phone number re-pairing requires human action (scanning QR code).
-
n8n-postgres has no CPU cap. PostgreSQL can occasionally spike CPU during autovacuum or complex queries from n8n’s execution history pruning. This can push total CPU above 2.0 vCPU.
-
No redundancy. All critical automation runs on a single 2-vCPU machine with no failover.
-
96 GB disk is limited for a machine also running the GitHub Actions runner which builds et-operational-platform.
What should move first
n8n (with n8n-postgres) is the highest priority because:
- The compose file for bms-4 is already written and committed
- n8n consumes 1.5 CPU — removing it frees 75% of vps-h1’s CPU
- Migration checklist is documented in
docs/servers/p4-ovh-bms-4-ns3101999-operations.md
WAHA should move second because:
- After n8n leaves, WAHA is the only remaining critical workload on vps-h1
- WAHA’s webhook URL currently points to
n8n.vps-h1.infra.zintegrowana.online— once n8n moves to bms-4, this config line must change anyway - bms-4 already has Traefik configured for TLS termination
- Keeping WAHA on vps-h1 alone maintains a second monthly VPS cost for a single container
Post-n8n-migration state on vps-h1
After n8n and n8n-postgres move to bms-4:
| Remaining service | RAM | Critical? |
|---|---|---|
| traefik | 80 MB | Yes (for WAHA TLS) |
| waha | 800 MB | Critical |
| node-exporter | 20 MB | No |
| cadvisor | 100 MB | No |
| promtail | 50 MB | No |
| claude-runner (nightly agent) | 300 MB | Medium |
| GitHub Actions runner (hstgr) | 200 MB | Medium |
Conclusion: vps-h1 would have ~6.5 GB free RAM and ~1.9 vCPU free — massively underutilised for ~10€/month. Recommendation: migrate WAHA to bms-4, reassign GitHub Actions runner to bms-4, decommission vps-h1.
4. bms-4 Expansion Plan
Current state (post-provisioning)
- MongoDB arbiter: 75 MB RAM
- node_exporter: 20 MB RAM
- Docker CE: installed, no containers running
- Free: ~31.9 GB RAM, 7.5 vCPU, 1.7 TB disk
Phase 1 — n8n migration (immediate, PENDING)
Deploy from bms-4/docker-compose.yml:
| Service | RAM (est.) | CPU | Notes |
|---|---|---|---|
| traefik | 80 MB | low | TLS for all bms-4 services |
| n8n-postgres | 300 MB | low | PostgreSQL 16 for n8n |
| n8n | 1.2 GB | cap 1.5 | Migrated from vps-h1 |
| node-exporter (Docker) | 20 MB | minimal | Duplicate of systemd exporter; needed for cadvisor compat |
| cadvisor | 100 MB | low | Container metrics |
After Phase 1: ~30.2 GB RAM free, ~6 vCPU free.
Phase 2 — WAHA migration (after n8n verified stable)
Add WAHA to bms-4 compose:
| Service | RAM (est.) | Notes |
|---|---|---|
| waha | 800 MB | Requires updating WHATSAPP_HOOK_URL → n8n.bms-4.infra.zintegrowana.online |
After Phase 2: ~29.4 GB RAM free.
Phase 3 — PDF services (new deployment, see §5)
| Service | RAM (est.) | Notes |
|---|---|---|
| gotenberg | 400 MB | Chromium PDF renderer |
| pdf-service-pinbox | 100 MB | New instance — production-facing |
| pdf-to-jpg | 200 MB | New service — replaces Convertio.ai |
After Phase 3: ~28.7 GB RAM free.
Phase 4 — AI-Dev-BMS4 agent (see §6)
| Service | RAM (est.) | Notes |
|---|---|---|
| claude-runner (user process) | 300 MB per agent | Max 4 parallel agents |
| GitHub Actions runner | 200 MB | Relocate from vps-h1 |
After Phase 4: ~27.9 GB RAM free (4 agents simultaneously).
bms-4 Resource Forecast (fully loaded)
| Category | RAM | CPU | Disk |
|---|---|---|---|
| MongoDB arbiter | 75 MB | negligible | minimal |
| n8n stack | 1.6 GB | 1.5 cap + overhead | 10 GB |
| WAHA | 800 MB | 0.2 | 2 GB |
| PDF services | 700 MB | burst | 5 GB |
| Observability | 240 MB | minimal | 1 GB |
| AI agents (4x) | 1.2 GB | burst | — |
| OS + kernel | ~500 MB | 0.5 | 8 GB |
| TOTAL | ~5.1 GB | ~5 vCPU peak | ~26 GB |
| Remaining headroom | ~26.9 GB | ~3 vCPU idle | ~1.77 TB |
bms-4 can comfortably absorb all planned workloads while retaining over 80% RAM headroom.
5. PDF Services Production Plan
Current state
The existing pdf-service (on vps-i1, port 8100) serves only:
- Claude Code agents via
p24-infra-mcpMCP server audit-engineinternal workflows
It is not accessible to Pinbox24 production on bms-1, and the existing instance should remain dedicated to internal tooling.
Requirements for production PDF services
- PDF generation — replace
pdf-gen(wkhtml container on bms-1) with a modern Gotenberg-based service callable by Pinbox24 Angular - PDF-to-JPG conversion — replace Convertio.ai (external SaaS); must accept PDF input, return JPG/PNG output; used for document thumbnails in Pinbox24
Option analysis
| Option | PDF Generation | PDF-to-JPG | Network path from bms-1 | Disk pressure on bms-1 |
|---|---|---|---|---|
| A: Deploy on bms-4 | Gotenberg + pdf-service | Gotenberg img conversion or ImageMagick | Cross-server HTTP (LAN-speed OVH internal network) | None |
| B: Deploy on bms-1 | Same | Same | localhost | CRITICAL — disk 100% full |
| C: Deploy on bms-3 | Same | Same | Cross-server HTTP | None |
Option B is eliminated: bms-1 disk is 100% full and OS is EOL. Adding containers there would immediately fail.
Option C is not recommended: bms-3 RAM is at ~25 GB/32 GB (MongoDB alone uses 21.7 GB), leaving only ~10 GB for all staging containers. Adding PDF conversion load on an already memory-constrained server is a stability risk for the MongoDB primary.
Recommendation: Option A — Deploy on bms-4.
Justification:
- bms-4 has 28+ GB free RAM and 1.7 TB free disk after all planned services
- Traefik is already being deployed on bms-4 — PDF services get TLS endpoints for free
- OVH bare metal servers are on the same internal network — cross-server HTTP between bms-1 → bms-4 is low-latency
- bms-4 will also run AI-Dev-BMS4 agent, which can monitor and restart PDF services autonomously
- Separating PDF infrastructure from vps-i1 (monitoring stack) removes a cross-concern dependency
- A dedicated
pdf.bms-4.infra.zintegrowana.onlineendpoint can be secured with API key auth (same pattern as existing pdf-service)
Deployment plan for bms-4 PDF services
Service 1: pdf-service-p24 (PDF generation for Pinbox24)
Reuse existing infra-src/gotenberg/pdf-service/ codebase with a new Gotenberg instance.
# Add to bms-4/docker-compose.yml
gotenberg-p24:
image: gotenberg/gotenberg:8.34.0
restart: unless-stopped
command:
- gotenberg
- --chromium-disable-javascript=false # Pinbox24 Angular uses JS-rendered pages
- --api-timeout=60s
pdf-service-p24:
build: ../infra-src/gotenberg/pdf-service
restart: unless-stopped
environment:
- PDF_SERVICE_API_KEY=${PDF_SERVICE_API_KEY}
- GOTENBERG_URL=http://gotenberg-p24:3000
- SUPABASE_URL=${SUPABASE_URL}
- SUPABASE_SERVICE_KEY=${SUPABASE_SERVICE_KEY}
- WASABI_ACCESS_KEY=${WASABI_ACCESS_KEY}
- WASABI_SECRET_KEY=${WASABI_SECRET_KEY}
- WASABI_BUCKET=p24-infra
- WASABI_ENDPOINT=https://s3.eu-central-2.wasabisys.com
- WASABI_REGION=eu-central-2
labels:
- traefik.enable=true
- traefik.http.routers.pdf-p24.rule=Host(`pdf.bms-4.infra.zintegrowana.online`)
- traefik.http.routers.pdf-p24.tls=true
- traefik.http.routers.pdf-p24.entrypoints=websecure
- traefik.http.routers.pdf-p24.tls.certresolver=mytlschallengeService 2: pdf-to-jpg (Convertio.ai replacement)
Gotenberg supports POST /forms/chromium/convert/url and POST /forms/libreoffice/convert but not direct PDF-to-image rasterisation. The correct approach is a thin Python microservice using pdf2image (wraps pdftoppm) or ImageMagick via Ghostscript.
Recommended implementation: pdf-to-jpg microservice using pdf2image Python library.
- Input: PDF file (multipart upload)
- Output: JPG bytes (single page or ZIP of all pages)
- Auth: same
PDF_SERVICE_API_KEYpattern - Base image:
python:3.12-slimwithpoppler-utilsinstalled - RAM: ~150–200 MB per conversion, ~50 MB idle
pdf-to-jpg:
build: ../infra-src/pdf-to-jpg # new microservice to be created
restart: unless-stopped
environment:
- PDF_SERVICE_API_KEY=${PDF_SERVICE_API_KEY}
- DPI=${PDF_TO_JPG_DPI:-150}
- MAX_FILE_MB=${PDF_TO_JPG_MAX_MB:-20}
labels:
- traefik.enable=true
- traefik.http.routers.pdf-to-jpg.rule=Host(`pdf.bms-4.infra.zintegrowana.online`) && PathPrefix(`/v1/pdf-to-jpg`)
- traefik.http.routers.pdf-to-jpg.tls=true
- traefik.http.routers.pdf-to-jpg.entrypoints=websecure
- traefik.http.routers.pdf-to-jpg.tls.certresolver=mytlschallengePinbox24 migration path:
- Deploy
pdf-service-p24+pdf-to-jpgon bms-4 - Test endpoints from bms-1:
curl -X POST https://pdf.bms-4.infra.zintegrowana.online/v1/md-render ... - Update Pinbox24 Angular app config to use new endpoints (remove Convertio.ai API key)
- Remove Convertio.ai subscription once verified
- Remove
pdf-gen+wkhtmlcontainers from bms-1 (helps reclaim the critical disk space)
Implementation note on pdf-to-jpg microservice: The infra-src/pdf-to-jpg/ directory needs to be created with:
app.py(FastAPI, similar pattern to pdf-service)Dockerfile(python:3.12-slim, installspoppler-utilsvia apt,pdf2imagevia pip)tests/directory
6. AI-Dev-BMS4 Agent Setup Plan
Overview
bms-4 should have a Claude Code autonomous agent (AI-Dev-BMS4) for:
- Local Docker operations on bms-4 itself
- Managing n8n, WAHA, PDF services
- Running scheduled tasks and issue implementation
User setup
Follow the same pattern as AI-Dev-OV1 on bms-2 and claude-admin on bms-3:
# Run as root on bms-4 (54.36.123.110)
# 1. Create claude-runner user
useradd -m -s /bin/bash claude-runner
mkdir -p /home/claude-runner/workspace
# 2. Create claude-admin user for SSH access from GitHub Actions / remote ops
useradd -m -s /bin/bash claude-admin
mkdir -p /home/claude-admin/.ssh
# 3. Install VPS_SSH_PRIVATE_KEY public part for claude-admin
echo "<VPS_SSH_PRIVATE_KEY public part>" > /home/claude-admin/.ssh/authorized_keys
chmod 700 /home/claude-admin/.ssh && chmod 600 /home/claude-admin/.ssh/authorized_keys
chown -R claude-admin:claude-admin /home/claude-admin/.ssh
# 4. Grant scoped sudo to claude-admin
echo "claude-admin ALL=(ALL) NOPASSWD: /usr/bin/docker, /bin/systemctl, /bin/mkdir, /bin/chown, /bin/cp, /usr/bin/tee" \
> /etc/sudoers.d/claude-admin
# 5. Install Claude Code CLI
curl -fsSL https://raw.githubusercontent.com/anthropics/claude-code/refs/heads/main/install.sh | bash
# or: copy from bms-2 if network is slow
# 6. Copy OAuth credentials from local workstation
# scp C:\Users\konar\.claude\.credentials.json root@54.36.123.110:/home/claude-runner/.claude/
# mkdir -p /home/claude-runner/.claude
# chown -R claude-runner:claude-runner /home/claude-runner/.claude
# 7. Clone infra repo
git clone https://github.com/radieu/p24-infra /opt/p24-infra
chown -R claude-runner:claude-runner /opt/p24-infra
# 8. Set up .claude-env for secrets
# Copy from vps-h1 or create fresh — contains GITHUB_TOKEN, PDF_SERVICE_API_KEY, etc.GitHub setup
- Label:
AI-Dev-BMS4 - GitHub user:
AI-Dev-BMS4(to be created usingai-dev-bms4@zintegrowana.online) - Cloudflare email route: Add routing rule for
ai-dev-bms4@zintegrowana.online→radieu@gmail.com - Repository access: Add as collaborator to
radieu/p24-infra(write) andradieu/et-operational-platform(write)
Resource limits
| Resource | Limit | Rationale |
|---|---|---|
| Parallel agents | max 4 | bms-4 has 8 vCPU, matches AI-Dev-OV1 on bms-2 |
| RAM per agent | ~300 MB (unbounded) | 32 GB RAM — no pressure |
| Disk for workspace | 50 GB | /opt/ on 1.8 TB RAID1 |
Capabilities
- Full
dockeraccess (via claude-admin scoped sudo) systemctlfor mongod management- SSH to other servers using
VPS_SSH_PRIVATE_KEY(via GitHub Actions) - Access to
p24-infra-mcpMCP server for PDF operations
7. Service Distribution Matrix
Security risk
| Service | SPoF Risk | Security Risk | Action |
|---|---|---|---|
| WAHA (vps-h1) | HIGH — only 1 node, no failover | Medium — exposed WhatsApp session | Migrate to bms-4; consider session backup |
| n8n (vps-h1) | HIGH — only 1 node | High — holds all automation secrets | Migrate to bms-4 (larger, more stable) |
| Supabase | Low — managed HA | Low — managed service | None |
| Prometheus (vps-i1) | Medium — single node | Low — internal only | Thanos provides remote backup |
| Traccar (vps-i1) | Medium — single node | Low — internal only | Low priority |
| mongod PRIMARY (bms-3) | HIGH — if bms-3 fails, writes stop | Medium — RS election needed | bms-4 arbiter provides election quorum |
| pdf-service (vps-i1) | Medium — mixed with monitoring stack | Low — API key protected | Deploy dedicated instance on bms-4 |
| Convertio.ai | Medium — external SaaS dependency | HIGH — sends Pinbox24 documents to 3rd party | Replace with self-hosted pdf-to-jpg ASAP |
| bms-1 disk | CRITICAL — 100% full | High — EOL OS | Emergency: prune Docker images, plan migration |
| bms-3 RAM | High — MongoDB at 21.7 GB | Low | Monitor for OOM |
| openclaw gateway (vps-i1) | Medium — no failover | Low — HMAC verified | Acceptable |
Single Point of Failure analysis
| Component | SPoF? | Mitigation |
|---|---|---|
| n8n workflows | Yes — single instance | n8n volume backed by PostgreSQL; data survives container restart |
| WAHA WhatsApp session | Yes | Session stored in Docker volume (waha_sessions); requires QR re-pairing if session corrupted |
| MongoDB PRIMARY (bms-3) | Partial — arbiter on bms-4 allows election | bms-2 observer can be promoted; bms-4 arbiter provides quorum |
| Vercel (et-operational-platform) | No — Vercel HA | None needed |
| Supabase | No — managed HA | None needed |
| PDF generation (vps-i1) | Yes — Gotenberg is single instance | Acceptable for internal tooling; production instance on bms-4 adds redundancy |
| Prometheus metrics | Partial — 15d local + Wasabi via Thanos | Thanos provides durable history |
| GitHub Actions runner | Partial — only 1 ionos runner | hstgr runner on vps-h1 is backup |
8. Recommended Target Architecture
Desired final state (after all migrations)
+─────────────────────────────────────────────────────────────────────────────+
| COMPUTE INFRASTRUCTURE |
+─────────────────────────────────────────────────────────────────────────────+
vps-i1 (IONOS) — 6 vCPU / 7.4 GB — MONITORING HUB
├── Caddy (TLS)
├── Prometheus + Thanos + Alertmanager
├── Grafana + Image Renderer
├── Loki + Promtail
├── Uptime Kuma
├── All custom Python exporters (7x)
├── Blackbox exporter
├── pdf-service + p24-infra-mcp (internal tooling only)
├── audit-engine
├── Traccar + MySQL (GPS)
├── OpenClaw WhatsApp gateway
├── GitHub Actions runner (ionos) — CI/CD
└── claude-runner autonomous agent (nightly)
bms-4 (OVH) — 8 vCPU / 32 GB / 1.8 TB — AUTOMATION + PDF HUB
├── MongoDB arbiter (rs0 quorum)
├── Traefik (TLS)
├── n8n + n8n-postgres (migrated from vps-h1) [Phase 1]
├── WAHA WhatsApp gateway (migrated from vps-h1) [Phase 2]
├── gotenberg-p24 + pdf-service-p24 (Pinbox24 PDF gen) [Phase 3]
├── pdf-to-jpg microservice (Convertio.ai replacement) [Phase 3]
├── node-exporter + cadvisor
├── AI-Dev-BMS4 Claude agent (max 4 parallel) [Phase 4]
└── GitHub Actions runner (relocate from vps-h1) [Phase 4]
bms-2 (OVH) — 8 vCPU / 32 GB / 410 GB — CLAUDE DEV ENV
├── MongoDB observer (rs0 non-voting read replica)
└── AI-Dev-OV1 Claude agent (max 4 parallel)
bms-3 (OVH) — 8 vCPU / 32 GB / 410 GB — MONGODB PRIMARY + STAGING
├── MongoDB 7.0 PRIMARY (rs0)
├── Pinbox24 staging (v31/v32/v41/v42)
├── Traccar (GPS staging)
├── nginx-proxy + Let's Encrypt
└── mt5
bms-1 (OVH) — 8 cores / 32 GB — PINBOX24 PRODUCTION [EOL — plan migration]
├── nginx-proxy + Let's Encrypt
├── Pinbox24 v31/v32/v41/v42 (production)
├── mailgun relay
└── git-deploy
NOTE: pdf-gen + wkhtml removable once bms-4 PDF services are live
vps-h1 (Hostinger) — DECOMMISSION after Phase 2
└── (empty — all workloads migrated to bms-4)
+─────────────────────────────────────────────────────────────────────────────+
| SAAS |
+─────────────────────────────────────────────────────────────────────────────+
Supabase — fleet management DB, audit control plane, DevOps tables
Vercel — et-operational-platform (prod + staging), portal, et-lager
Cloudflare — DNS zintegrowana.online, WAF
Wasabi S3 — Thanos blocks, PDFs, backup status
GitHub — source control, CI/CD
Mailgun EU — alerts + transactional email
n8n.io Cloud — secondary automation instance
Migration sequence
Step 1 (IMMEDIATE): rs.addArb("54.36.123.110:27017") + rs.remove dead arbiter
└── HUMAN ACTION — requires MongoDB admin password from bms-3
Step 2 (WEEK 1): Migrate n8n from vps-h1 → bms-4 (compose file ready)
Step 3 (WEEK 1): Update WAHA webhook URL → n8n.bms-4.infra.zintegrowana.online
Step 4 (WEEK 2): Migrate WAHA from vps-h1 → bms-4
Step 5 (WEEK 2): Decommission vps-h1 (cancel subscription)
Step 6 (WEEK 3): Create infra-src/pdf-to-jpg microservice
Step 7 (WEEK 3): Deploy pdf-service-p24 + pdf-to-jpg on bms-4
Step 8 (WEEK 3): Update Pinbox24 Angular to use new PDF endpoints
Step 9 (WEEK 4): Remove Convertio.ai subscription
Step 10 (WEEK 4): Install AI-Dev-BMS4 agent + GitHub user
Step 11 (ONGOING): Monitor bms-1 disk (CRITICAL), plan OS migration Ubuntu 20.04 → 24.04
Step 12 (ONGOING): Install node_exporter on bms-2 and bms-3 for Prometheus coverage
Appendix: Open Issues
| Issue | Server | Severity | Action required |
|---|---|---|---|
| bms-1 disk 100% full | bms-1 | CRITICAL | docker system prune -f, identify largest directories, plan migration |
| bms-1 Ubuntu 20.04 EOL | bms-1 | HIGH | Plan in-place upgrade or migration to bms-4-like server |
| WAHA at risk on overloaded vps-h1 | vps-h1 | HIGH | Migrate to bms-4 (Phase 2) |
| MongoDB arbiter not yet added to rs0 | bms-4 | HIGH | Human action: rs.addArb() from bms-3 |
| n8n migration not yet executed | vps-h1/bms-4 | HIGH | Execute migration checklist |
| bms-2 + bms-3 not in Prometheus | bms-2, bms-3 | MEDIUM | Install node_exporter, add scrape targets |
| bms-3 MongoDB RAM at 21.7 GB | bms-3 | MEDIUM | Monitor; alert if container RAM pressure approaches 10 GB remaining |
n8n bms-4 compose has deprecated N8N_RUNNERS_ENABLED=false | bms-4 | LOW | Remove from bms-4/docker-compose.yml (current branch fixes it on vps-h1) |
| Convertio.ai — external SaaS with document exposure risk | Pinbox24 | MEDIUM | Phase 3: deploy pdf-to-jpg on bms-4 |
| claude-admin not set up on bms-2 and bms-3 | bms-2, bms-3 | LOW | Follow setup instructions in ops workbooks |