Spec 04 — IaC for VPS state (Ansible)
Purpose
scripts/install-ionos.sh and bootstrap-new-vps.sh are imperative bash. They work for the initial install but offer no drift detection, no idempotency guarantees, and no auditable history of what was installed/configured/changed on each VPS. Six months from now, rebuilding a VPS will recover only what we remembered to scriptify.
Ansible is the lightest-weight fit: agentless (just SSH), no daemon to maintain, plays read top-to-bottom like docs. We don’t need Terraform because we don’t programmatically provision the VPSes themselves — IONOS and Hostinger control panels handle that.
Depends on spec 03 (secrets) so Ansible can pull decrypted values via sops at run-time rather than carrying its own vault.
Rulebook
- All persistent server state is declared in Ansible. Manual changes are emergencies only and must be back-ported within 24 h.
ansible-playbook --checkruns weekly in CI. Any diff opens adrift-detectedissue.- Roles, not playbooks. Group reusable units (docker, claude-runner, prometheus) into roles; playbooks compose roles per host.
- Inventory is in git.
ansible/inventory/hosts.ymllists every VPS with role tags. - Tags for safety.
--tags secretsdeploys only env files;--tags composeonly restarts services; never run untagged in production unless explicitly intended.
Architecture
ansible/
├── inventory/
│ └── hosts.yml # vps-i1, vps-h1, future ovh-f
├── group_vars/
│ ├── all.yml # shared (DNS zone, alert recipient)
│ └── monitoring_hosts.yml
├── host_vars/
│ ├── vps-i1.yml
│ └── vps-h1.yml
├── roles/
│ ├── common/ # users, hostname, fail2ban, ufw, ntp
│ ├── docker/
│ ├── claude-runner/
│ ├── claude-admin-user/
│ ├── github-runner/
│ ├── monitoring-stack/ # the existing compose stack
│ ├── n8n-stack/ # the Hostinger compose stack
│ ├── node-exporter/
│ └── cadvisor/
└── playbooks/
├── site.yml # everything
├── vps-i1.yml
├── vps-h1.yml
└── provision-new-vps.yml # replaces scripts/bootstrap-new-vps.sh
Implementation plan
Phase 1 — inventory the IONOS VPS (1 d)
- Run
ansible-inventory --listagainst the live server (usingansible_facts) to capture current state. - Author roles by reverse-engineering existing scripts (
install-ionos.sh,setup-claude-env.sh, etc.). - Run
ansible-playbook vps-i1.yml --check --diffrepeatedly until it reports zero changes.
Phase 2 — Hostinger parity (1 d)
- Same exercise for vps-h1.
- Move
hostinger/docker-compose.ymltemplate into then8n-stackrole.
Phase 3 — CI drift detection (0.5 d)
- New workflow
.github/workflows/ansible-drift.yml— weekly, runs--check --diffagainst all hosts. - On diff, opens a
drift-detectedissue with the diff embedded.
Phase 4 — replace bootstrap script (0.5 d)
provision-new-vps.ymlworkflow updated to runansible-playbook provision-new-vps.ymlinstead of the bash script.- Document in
.claude/commands/provision-vps.md.
Acceptance criteria
-
ansible-playbook site.yml --check --diffreports zero changes immediately after a manual deploy - Drift CI workflow runs successfully on schedule
- Provisioning a scratch VPS with
provision-new-vps.ymlproduces a working monitoring host that passeshealth-check.ymlprobes - All previously-imperative scripts under
scripts/either deleted or marked deprecated with pointers to the Ansible role -
docs/runbook.mdupdated: every “ssh + apt install” recipe replaced with “edit role + run playbook”
Cost impact
0 €. Ansible is free; CI minutes are within free tier (one run/week × few minutes).
Back-out plan
Ansible doesn’t actively change anything if we don’t run it. Removing the workflows + folder leaves the VPSes exactly as they are. Old shell scripts remain in git history.
Risks / open questions
- Risk: Reverse-engineering current state imperfectly = drift on first run. Mitigation:
--check --diffrepeatedly before any actual--apply. - Q: Why not Terraform with the
sshprovisioner? A: Terraform is for provisioning cloud resources; we use control panels. Ansible is purpose-built for the config-management half of the job. - Q: NixOS? A: Too steep a curve for one operator. Ansible’s plain-YAML is more legible by future-Claude and future-you.
Bootstrap
The PR that lands the scaffolding produces artifacts only — no playbook has been applied to a live VPS. The human operator must run the following sequence once to declare the playbook authoritative.
-
Install Ansible on the dev machine:
pip install 'ansible-core>=2.16' 'jmespath' -
Install collections:
cd ansible ansible-galaxy install -r requirements.yml -
Verify connectivity (root SSH key required —
~/.ssh/id_ed25519):ansible all -m pingExpected:
vps-i1 | SUCCESS => "ping": "pong"and same forvps-h1. -
First dry-run against each VPS — read the diff carefully:
ansible-playbook playbooks/vps-i1.yml --check --diff ansible-playbook playbooks/vps-h1.yml --check --diffExpect a non-zero diff on the first run because Ansible recomputes some idempotent assertions (e.g. Docker apt-keyring fingerprint, cron entry format). Do not apply yet.
-
If the diff is large, edit the relevant role to converge with reality (this is the “reverse-engineer” phase). Repeat step 4 until the diff is zero or limited to items you intentionally want to converge.
-
Once diff is acceptable, the playbook is declared faithful. From this point, manual changes on the VPS must be back-ported to Ansible within 24 h (rulebook §7).
-
Validate weekly drift CI:
gh workflow run ansible-drift.yml gh run watchConfirm the workflow runs end-to-end and only opens a drift issue if
--checkshows changes. -
Rotate
VPS_ROOT_SSH_KEYif older than 90 days (the secret is used by the drift workflow). -
Future (post spec 09): when
claude-adminexists onvps-h1, switch the--checkSSH user fromroottoclaude-adminto reduce blast radius.