Rulebook — Building Infrastructure Improvements

Conventions for designing, implementing, and operating the improvements in docs/improvements/. These rules exist because infra mistakes are expensive (lost data, leaked secrets, midnight pages) — we trade a little overhead up front for a lot of reliability later.

1. One spec, one issue, one PR (where possible)

Every spec gets a GitHub issue before any code is written.
Issues reference the spec file: docs/improvements/NN-name.md.
Default to one PR per spec. Only split if the PR exceeds ~500 LOC or crosses tier boundaries (e.g. P1 + P2 in one PR is fine; mixing infra + product code is not).
PRs target main (single-branch model — see CLAUDE.md).

2. Specs must be measurable

Every spec ends with Acceptance Criteria that a reviewer can check in <10 minutes. Examples:

✅ “After running ./test-restore.sh, n8n boots with all 12 workflows present and the test workflow executes successfully.”
❌ “Backups work.”

If you can’t write a 10-minute check, the spec isn’t done — go back to the design.

3. Every change has a back-out path

For each spec, document how to revert before merging. Specifically:

What command/PR rolls it back?
What state is left behind (volumes, cron jobs, registry entries)?
Does the back-out lose data? If yes, say so loudly.

4. Test against IONOS before OVH

We don’t have a true staging tier for infra. The convention is:

IONOS VPS = de-facto staging for the monitoring stack.
Hostinger VPS = production for n8n + WAHA (no second instance — see §6).
OVH Server F = future production for monitoring (not yet provisioned).

Until OVH exists, validate every monitoring change on IONOS first. Once OVH is live, IONOS becomes proper staging.

5. Secrets never leave their boundary

Never commit unencrypted secrets. Use sops once spec 03 lands.
Never paste credentials into Claude Code sessions, issue bodies, PR descriptions, or commit messages. Treat any value that touched an LLM session as compromised — rotate it.
Never print full secrets to CI logs. Echo only fingerprints (sha256sum | head -c 8).
The full secrets inventory lives in docs/infrastructure-overview.md §9.

6. Single-instance services need backups, not HA

For our scale (one user, ~10 daily-active services) full HA is overkill. The policy:

Single-instance services (n8n, WAHA, Traccar, Grafana) must have automated backups + a documented restore drill.
The restore drill must be run at least once per quarter. Set a calendar reminder.
A backup that hasn’t been restored is a hypothesis, not a backup.

7. Drift is the enemy

After spec 04 lands, every server state change goes through Ansible — no manual apt install, systemctl enable, or useradd on the VPS.
Manual emergency changes are allowed but must be back-ported to Ansible within 24 hours, or you’ve created a drift bug.
Weekly ansible-playbook --check runs in CI; any diff opens an issue.

8. Observability before automation

Don’t automate something you can’t observe. The order is:

Make the manual process work and document it.
Add metrics + alerts.
Then automate.

This prevents silent-failure automations (e.g. nightly backups that fail and nobody knows for months).

9. Update the runbook with every operational change

If a spec introduces a new alert, a new daily process, or a new manual recovery step, docs/runbook.md (and docs/hostinger-runbook.md once it exists — spec 13) must be updated in the same PR.

If it’s not in the runbook, future-you doesn’t know about it at 3 AM.

10. Cost-aware defaults

Prefer free-tier or already-paid services (Wasabi, Cloudflare Free, Vercel Hobby, GitHub Free + self-hosted runners).
Avoid SaaS where a single-binary container does the job (e.g. Uptime Kuma over BetterStack).
Any spec that increases monthly cost by >5€ must justify it in the Cost impact section.

11. Naming conventions

DNS: {service}.{vps-label}.infra.zintegrowana.online — see CLAUDE.md.
Docker compose service names: kebab-case, no env suffix (we differentiate by VPS, not stack).
File names in this folder: NN-kebab-case.md, two-digit prefix matches priority order.
Branch names: feat/NN-improvement-slug (e.g. feat/01-backups).

12. Don’t pre-design for hypothetical scale

We have one production fleet platform, one infra VPS, one n8n VPS, and one developer. Don’t add Kubernetes, service mesh, or distributed tracing because “we might need it someday”. If we need it, we’ll have time to design it properly then.

Three similar manual steps beats a premature abstraction.

Spec template

Every spec follows this skeleton:

---
id: NN
title: ...
tier: P1|P2|P3
effort: 0.5d
depends_on: []
status: proposed|in-progress|done
---
 
# Spec NN — <title>
 
## Purpose
Why this exists. What breaks today without it.
 
## Rulebook (operating rules)
Day-2 rules once shipped: what to do/not-do, who runs the restore drill, etc.
 
## Architecture
ASCII diagram or table. What runs where, what talks to what.
 
## Implementation plan
Numbered steps with concrete file paths and commands.
 
## Acceptance criteria
Checklist a reviewer can run in 10 min.
 
## Cost impact
€/month delta. If zero, say so.
 
## Back-out plan
How to revert. What gets left behind.
 
## Risks / open questions

Glossary

Spec — a markdown file in this folder describing one improvement.
Rulebook section — within a spec, the day-2 operating rules once it ships.
Restore drill — actual restore of a real backup into a scratch container, verifying data integrity.
Drift — divergence between declared state (git/Ansible) and actual server state.

p24-infra Docs

Explorer

rulebook