Spec 04 — IaC for VPS state (Ansible)

Purpose

scripts/install-ionos.sh and bootstrap-new-vps.sh are imperative bash. They work for the initial install but offer no drift detection, no idempotency guarantees, and no auditable history of what was installed/configured/changed on each VPS. Six months from now, rebuilding a VPS will recover only what we remembered to scriptify.

Ansible is the lightest-weight fit: agentless (just SSH), no daemon to maintain, plays read top-to-bottom like docs. We don’t need Terraform because we don’t programmatically provision the VPSes themselves — IONOS and Hostinger control panels handle that.

Depends on spec 03 (secrets) so Ansible can pull decrypted values via sops at run-time rather than carrying its own vault.


Rulebook

  1. All persistent server state is declared in Ansible. Manual changes are emergencies only and must be back-ported within 24 h.
  2. ansible-playbook --check runs weekly in CI. Any diff opens a drift-detected issue.
  3. Roles, not playbooks. Group reusable units (docker, claude-runner, prometheus) into roles; playbooks compose roles per host.
  4. Inventory is in git. ansible/inventory/hosts.yml lists every VPS with role tags.
  5. Tags for safety. --tags secrets deploys only env files; --tags compose only restarts services; never run untagged in production unless explicitly intended.

Architecture

ansible/
├── inventory/
│   └── hosts.yml             # vps-i1, vps-h1, future ovh-f
├── group_vars/
│   ├── all.yml               # shared (DNS zone, alert recipient)
│   └── monitoring_hosts.yml
├── host_vars/
│   ├── vps-i1.yml
│   └── vps-h1.yml
├── roles/
│   ├── common/               # users, hostname, fail2ban, ufw, ntp
│   ├── docker/
│   ├── claude-runner/
│   ├── claude-admin-user/
│   ├── github-runner/
│   ├── monitoring-stack/     # the existing compose stack
│   ├── n8n-stack/            # the Hostinger compose stack
│   ├── node-exporter/
│   └── cadvisor/
└── playbooks/
    ├── site.yml              # everything
    ├── vps-i1.yml
    ├── vps-h1.yml
    └── provision-new-vps.yml # replaces scripts/bootstrap-new-vps.sh

Implementation plan

Phase 1 — inventory the IONOS VPS (1 d)

  1. Run ansible-inventory --list against the live server (using ansible_facts) to capture current state.
  2. Author roles by reverse-engineering existing scripts (install-ionos.sh, setup-claude-env.sh, etc.).
  3. Run ansible-playbook vps-i1.yml --check --diff repeatedly until it reports zero changes.

Phase 2 — Hostinger parity (1 d)

  1. Same exercise for vps-h1.
  2. Move hostinger/docker-compose.yml template into the n8n-stack role.

Phase 3 — CI drift detection (0.5 d)

  1. New workflow .github/workflows/ansible-drift.yml — weekly, runs --check --diff against all hosts.
  2. On diff, opens a drift-detected issue with the diff embedded.

Phase 4 — replace bootstrap script (0.5 d)

  1. provision-new-vps.yml workflow updated to run ansible-playbook provision-new-vps.yml instead of the bash script.
  2. Document in .claude/commands/provision-vps.md.

Acceptance criteria

  • ansible-playbook site.yml --check --diff reports zero changes immediately after a manual deploy
  • Drift CI workflow runs successfully on schedule
  • Provisioning a scratch VPS with provision-new-vps.yml produces a working monitoring host that passes health-check.yml probes
  • All previously-imperative scripts under scripts/ either deleted or marked deprecated with pointers to the Ansible role
  • docs/runbook.md updated: every “ssh + apt install” recipe replaced with “edit role + run playbook”

Cost impact

0 €. Ansible is free; CI minutes are within free tier (one run/week × few minutes).

Back-out plan

Ansible doesn’t actively change anything if we don’t run it. Removing the workflows + folder leaves the VPSes exactly as they are. Old shell scripts remain in git history.

Risks / open questions

  • Risk: Reverse-engineering current state imperfectly = drift on first run. Mitigation: --check --diff repeatedly before any actual --apply.
  • Q: Why not Terraform with the ssh provisioner? A: Terraform is for provisioning cloud resources; we use control panels. Ansible is purpose-built for the config-management half of the job.
  • Q: NixOS? A: Too steep a curve for one operator. Ansible’s plain-YAML is more legible by future-Claude and future-you.

Bootstrap

The PR that lands the scaffolding produces artifacts only — no playbook has been applied to a live VPS. The human operator must run the following sequence once to declare the playbook authoritative.

  1. Install Ansible on the dev machine:

    pip install 'ansible-core>=2.16' 'jmespath'
  2. Install collections:

    cd ansible
    ansible-galaxy install -r requirements.yml
  3. Verify connectivity (root SSH key required — ~/.ssh/id_ed25519):

    ansible all -m ping

    Expected: vps-i1 | SUCCESS => "ping": "pong" and same for vps-h1.

  4. First dry-run against each VPS — read the diff carefully:

    ansible-playbook playbooks/vps-i1.yml --check --diff
    ansible-playbook playbooks/vps-h1.yml --check --diff

    Expect a non-zero diff on the first run because Ansible recomputes some idempotent assertions (e.g. Docker apt-keyring fingerprint, cron entry format). Do not apply yet.

  5. If the diff is large, edit the relevant role to converge with reality (this is the “reverse-engineer” phase). Repeat step 4 until the diff is zero or limited to items you intentionally want to converge.

  6. Once diff is acceptable, the playbook is declared faithful. From this point, manual changes on the VPS must be back-ported to Ansible within 24 h (rulebook §7).

  7. Validate weekly drift CI:

    gh workflow run ansible-drift.yml
    gh run watch

    Confirm the workflow runs end-to-end and only opens a drift issue if --check shows changes.

  8. Rotate VPS_ROOT_SSH_KEY if older than 90 days (the secret is used by the drift workflow).

  9. Future (post spec 09): when claude-admin exists on vps-h1, switch the --check SSH user from root to claude-admin to reduce blast radius.