MongoDB rs0 — Operations Guide

Replica set: rs0 Members: 3 (bms-2 PRIMARY · bms-3 SECONDARY · bms-4 ARBITER) Version: MongoDB 7.0 Data: Pinbox24 SaaS production + staging databases Last verified: 2026-06-17


CRITICAL — Backup Gap

No automated backup exists for MongoDB rs0. The last known manual dump was taken on bms-1 approximately 4+ months ago (w3-2026-02-05, w4-2026-02-23/24 in /root on bms-1). This is the highest-priority infrastructure gap in the entire platform.

Until automated backups are implemented, any data-loss event (disk failure, accidental drop, ransomware) will result in irreversible production data loss.

Action required: see 01-backups.md for the planned solution.


Architecture

                    ┌─────────────────────────────────────────┐
                    │           rs0 replica set                │
                    │                                          │
        ┌───────────┤  bms-2 — ns3087638                      │
        │  PRIMARY  │  145.239.133.104:27017                   │
        │  priority=1, votes=1                                 │
        │  Ubuntu 24.04 · 32 GB RAM                           │
        └───────────┤                                          │
                    │                                          │
        ┌───────────┤  bms-3 — ns3129867                      │
        │ SECONDARY │  51.68.155.224:27017                     │
        │  priority=1, votes=1                                 │
        │  Ubuntu 22.04 · 32 GB RAM                           │
        └───────────┤  ⚠️ ~21.7 GB RAM used by MongoDB        │
                    │                                          │
        ┌───────────┤  bms-4 — ns3101999                      │
        │  ARBITER  │  54.36.123.110:27017                     │
        │  priority=0, votes=1                                 │
        │  Ubuntu 22.04 · no data stored                      │
        └───────────┴─────────────────────────────────────────┘

  Election quorum = 2 votes (out of 3)
  bms-4 arbiter added 2026-06-10 (replaced dead arbiter at 51.83.132.99)

Key properties:

  • bms-4 stores no data — it participates in elections only
  • bms-2 and bms-3 both carry full data copies
  • If bms-2 fails, bms-3 (+ bms-4 arbiter vote) has quorum to elect a new primary automatically
  • If bms-4 fails, bms-2 + bms-3 still have 2 votes and remain operational

SSH Access

ServerRoleSSH Command
bms-2PRIMARYssh ubuntu@145.239.133.104
bms-3SECONDARYssh ubuntu@51.68.155.224
bms-4ARBITERssh root@54.36.123.110

SSH key: C:\Users\konar\.ssh\id_ed25519 (local) — installed on all three servers.

Claude agent on bms-2 and bms-3: ssh claude-admin@<host> (uses VPS_SSH_PRIVATE_KEY).


Day-to-Day Operations

Connect to MongoDB shell

# On any data-bearing member (bms-2 or bms-3):
mongosh --port 27017
 
# With authentication:
mongosh --port 27017 -u <admin-user> -p --authenticationDatabase admin

Credentials: Infisical CE → bms-servers project. Never commit passwords to git.

Check replica set status

# Quick status
rs.status()
 
# Compact member summary
db.adminCommand({ replSetGetStatus: 1 }).members.forEach(m =>
  print(m.name, m.stateStr, "optime:", m.optimeDate))
 
# Check replication lag (on PRIMARY — shows how far behind SECONDARY is)
rs.printSecondaryReplicationInfo()

Verify which node is PRIMARY

# From any member:
rs.isMaster()
# or
db.adminCommand({ hello: 1 })

Check replica lag manually

# On PRIMARY — shows lastHeartbeatMessage and optime for each member
rs.status().members.forEach(m => {
  print(m.name, m.stateStr, "lag:", m.optimeDate, "lastHeartbeat:", m.lastHeartbeatMessage)
})

Force step-down of PRIMARY (planned maintenance)

# Connect to current PRIMARY (bms-2 normally), then:
rs.stepDown(60)  # 60s: time to wait before stepping down (allows secondaries to catch up)

After stepDown, bms-3 will be elected PRIMARY. Verify with rs.status().


Replica Set Member Management

Add a new full voting member

# On current PRIMARY:
rs.add({ host: "<ip>:27017", priority: 1, votes: 1 })

Add an arbiter

# On current PRIMARY:
rs.addArb("<ip>:27017")

Remove a member

# On current PRIMARY:
rs.remove("<ip>:27017")

Re-add bms-4 arbiter after outage

# After bms-4 is restored and mongod is running:
# On PRIMARY (bms-2):
rs.addArb("54.36.123.110:27017")

Failover Scenarios

Scenario 1: bms-2 (PRIMARY) fails

  1. bms-3 detects heartbeat loss after ~10s
  2. bms-3 requests election — bms-4 arbiter provides the second vote (quorum = 2)
  3. bms-3 becomes PRIMARY automatically
  4. Application connections fail over to bms-3 (connection string must list multiple hosts)
  5. Verify: ssh ubuntu@51.68.155.224 then mongosh --eval "rs.status()"
  6. When bms-2 comes back: it rejoins as SECONDARY automatically, syncs data

Scenario 2: bms-3 (SECONDARY) fails

  1. rs0 continues operating — bms-2 (PRIMARY) + bms-4 (arbiter) = 2 votes, quorum maintained
  2. No automatic promotion needed
  3. Monitor: replication will be behind once bms-3 recovers
  4. Verify: rs.status() on bms-2 — bms-3 state shows UNREACHABLE or DOWN

Scenario 3: bms-4 (ARBITER) fails

  1. rs0 still has 2 votes: bms-2 (1) + bms-3 (1) = quorum maintained
  2. No immediate impact on reads or writes
  3. Risk: if bms-2 also fails while bms-4 is down, bms-3 cannot reach quorum (1 vote) and rs0 goes read-only
  4. Restore bms-4 promptly, then re-add arbiter: rs.addArb("54.36.123.110:27017")

Scenario 4: Split network / no quorum

If fewer than 2 votes are reachable, rs0 goes read-only (all members become SECONDARY). Recovery requires restoring network or reconfiguring rs with rs.reconfig(..., {force: true}).


Maintenance — Rolling Restart

Rolling restart allows zero-downtime MongoDB upgrades or config changes.

1. Restart SECONDARY (bms-3) first:
   ssh ubuntu@51.68.155.224
   sudo systemctl restart mongod
   # Wait for state to return to SECONDARY: rs.status()

2. Step down PRIMARY (bms-2):
   # On bms-2 primary:
   mongosh --eval "rs.stepDown(60)"
   # bms-3 becomes PRIMARY

3. Restart old primary (now SECONDARY, bms-2):
   sudo systemctl restart mongod
   # Verify it rejoins as SECONDARY

4. (Optional) Transfer primary back to bms-2:
   # On bms-3 (current primary):
   rs.stepDown(60)

KeyFile Rotation

The replica set uses a shared keyFile for inter-member authentication. Keyfile is stored at /etc/mongodb-keyfile on each member.

Never commit the keyFile content to git. Store the value in Infisical bms-servers project.

Rolling rotation procedure:

  1. Generate new keyFile content (random bytes, base64-encoded)
  2. Store in Infisical bms-serversMONGO_REPLICA_SET_KEY
  3. Place on bms-3 (SECONDARY), restart mongod, verify it rejoins
  4. Place on bms-4 (ARBITER), restart mongod, verify it rejoins
  5. Place on bms-2 (PRIMARY), step down, restart mongod, verify rejoins
  6. Update rotation log in docs/secrets-rotation-log.md

Monitoring

ExporterHostPortScrape target
node_exporterbms-29100Host CPU/RAM/disk metrics
node_exporterbms-39100Host CPU/RAM/disk metrics
mongodb_exporterbms-29216MongoDB internals (opcounters, connections, replication lag)
mongodb_exporterbms-39217MongoDB internals

Both exporters scrape into Prometheus on vps-i1. Grafana dashboard: MongoDB rs0 Overview.

Alerts configured:

  • MongoDBReplicationLag > 30s → warning
  • MongoDBMemberDown → critical
  • MongoDBNoPrimary → critical (P1)

Password and Secret Management

SecretLocation
MongoDB admin passwordInfisical bms-servers project → MONGO_ADMIN_PASSWORD
MongoDB keyFileInfisical bms-servers project → MONGO_REPLICA_SET_KEY
bms-2 root/ubuntu passwordInfisical bms-servers project
bms-3 root/ubuntu passwordInfisical bms-servers project

Never hardcode or display these values. Fetch from Infisical at runtime.