AI Batch GPU Processing — Operations Workbook

Nightly GPU inference for heavy report generation. GPU server spins up at 02:00 UTC, runs reports, shuts down. Pay only for ~4 hours/night.

Architecture

02:00 UTC
├── n8n cron trigger
│   ├── scale down to 1 n8n worker (worker-1 stays for GPS cron)
│   ├── resume GPU endpoint (HF / RunPod)
│   ├── wait ~4 min (cold start)
│   ├── POST reports to model API (OpenAI-compatible)
│   ├── pause GPU endpoint (billing stops)
│   └── restore n8n workers (worker-2 + worker-3)
│
└── GPU endpoint (HF Inference Endpoint or RunPod pod)
    └── vLLM or TGI serving the model
        └── POST /v1/chat/completions  ← OpenAI-compatible

Cost at 4 hours/night × 20 nights/month:

Provider	Hardware	$/hr	Monthly
HF Inference Endpoints	1× A100 80GB	~$3.50	~$280
RunPod On-Demand	1× A100 80GB	~$2.00	~$160
RunPod Spot	1× A100 80GB	~$1.30	~$104
Vast.ai	1× A100 80GB	~$1.20	~$96

Model Selection

Models that fit on 1× A100 80GB (most cost-efficient):

Model	Size	Strength	HF Hub ID
DeepSeek-R1-Distill-Qwen-32B	32B	Reasoning, analysis	`deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`
Qwen2.5-32B-Instruct	32B	General, multilingual	`Qwen/Qwen2.5-32B-Instruct`
Meta-Llama-3.1-70B-Instruct	70B	Strong general (needs ~75 GB — tight on A100 80GB with quantization)	`meta-llama/Meta-Llama-3.1-70B-Instruct`

For 70B+ models use 2× A100 or 1× H100 (price doubles but quality jump is significant for complex reports).

Recommended start: Qwen2.5-32B-Instruct — fast, fits comfortably, strong multilingual (Polish/German/English), good instruction following.

Option A — Hugging Face Inference Endpoints

Initial setup (one-time, via HF UI)

Go to huggingface.co/inference-endpoints
New endpoint → pick model (e.g. Qwen/Qwen2.5-32B-Instruct)
Hardware: 1× NVIDIA A100 80GB
Framework: Text Generation Inference (TGI)
Scaling: min=0, max=1 (scales to zero when paused)
Region: eu-west-1 (closest to bms-4)
Note the endpoint URL and name

API — start/pause/status

# Base: https://api.endpoints.huggingface.tech/v2/endpoint/{namespace}/{name}
# Auth: Authorization: Bearer ${HF_API_TOKEN}
 
# Resume (billing starts, cold start ~3-5 min)
curl -X PUT "https://api.endpoints.huggingface.tech/v2/endpoint/radieu/nightly-reports/resume" \
  -H "Authorization: Bearer ${HF_API_TOKEN}"
 
# Check status
curl "https://api.endpoints.huggingface.tech/v2/endpoint/radieu/nightly-reports" \
  -H "Authorization: Bearer ${HF_API_TOKEN}" | jq '.status.state'
# Returns: "scaledToZero" | "pending" | "initializing" | "running" | "paused"
 
# Pause (billing stops immediately)
curl -X PUT "https://api.endpoints.huggingface.tech/v2/endpoint/radieu/nightly-reports/pause" \
  -H "Authorization: Bearer ${HF_API_TOKEN}"

Inference call (OpenAI-compatible)

curl "https://{endpoint-url}/v1/chat/completions" \
  -H "Authorization: Bearer ${HF_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "Generate fleet report for..."}],
    "max_tokens": 4096,
    "temperature": 0.3
  }'

Required secrets (add to `/opt/bms4-services/.env` and GH Secrets)

Variable	Value
`HF_API_TOKEN`	HF token with `inference-endpoints:write` scope
`HF_ENDPOINT_NAMESPACE`	`radieu`
`HF_ENDPOINT_NAME`	e.g. `nightly-reports`
`HF_ENDPOINT_URL`	URL from HF dashboard

Option B — RunPod

Why RunPod over HF for batch

~40% cheaper on-demand, ~60% cheaper on spot
Serverless option: pay per second of actual compute, no idle billing even while “running”
Better for bursty workloads (reports that take varying time)

Setup (one-time)

Create account at runpod.io
Deploy → Serverless → New Endpoint
Select template: vLLM (pre-built, OpenAI-compatible)
Worker: 1× A100 80GB, max workers: 1
Set environment: MODEL_NAME=Qwen/Qwen2.5-32B-Instruct
Note the endpoint ID

API

# Serverless: no start/stop needed — scales to zero automatically
# Just POST jobs, RunPod spins up a worker on demand
 
# Inference
curl "https://api.runpod.ai/v2/{endpoint-id}/runsync" \
  -H "Authorization: Bearer ${RUNPOD_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "messages": [{"role": "user", "content": "Generate report..."}],
      "max_tokens": 4096,
      "temperature": 0.3
    }
  }'

Required secrets

Variable	Value
`RUNPOD_API_KEY`	RunPod API key
`RUNPOD_ENDPOINT_ID`	Serverless endpoint ID

n8n Workflow — Nightly GPU Batch

Workflow file: infra-src/n8n-workflows/nightly-gpu-batch.json

Flow

02:00 cron
  → Scale n8n: stop worker-2, worker-3
  → Resume HF endpoint (or RunPod wakes on first call)
  → Wait 4 min (cold start)
  → Poll status loop (max 10 attempts × 30s)
  → IF status == "running"
      → Run each report config (split in batches)
          → POST /v1/chat/completions
          → save result to Supabase / upload to GDrive
      → Pause HF endpoint
      → Start worker-2, worker-3
      → Discord: ✅ reports done, cost estimate
    ELSE (timeout)
      → Pause endpoint (don't leave it running)
      → Start workers
      → Discord: ❌ GPU endpoint failed to start

Report config format

Each report is a row in audit.actions with action_type = 'ai_workbook'. The workflow reads the active actions, generates the report via GPU endpoint instead of claude-proxy, and writes the result to audit.runs.

Scaling n8n from within n8n

The workflow calls a local script via SSH or exec node:

# Scale down before testing
cd /opt/bms4-services && docker compose stop n8n-worker-2 n8n-worker-3
 
# Scale up after
cd /opt/bms4-services && docker compose up -d n8n-worker-2 n8n-worker-3

Cost Monitoring

Add to Prometheus / cost-exporter:

Metric	Description
`gpu_batch_session_duration_seconds`	How long endpoint was running
`gpu_batch_reports_total`	Reports processed per session
`gpu_batch_estimated_cost_eur`	duration × hourly rate
`gpu_batch_last_success_timestamp`	Last successful run

Script writes .prom file to node_exporter textfile collector after each session.

Troubleshooting

Symptom	Cause	Fix
Endpoint stuck in “initializing”	Model too large for hardware tier	Upgrade hardware or use smaller model
503 from inference endpoint	Cold start not complete	Increase wait time in n8n (try 6 min)
Reports incomplete	Token limit hit	Increase `max_tokens` or split report into sections
n8n workers not restored	Workflow crashed mid-run	Run `testing-window-end.sh` manually
HF billing continued	Pause call failed	Check HF dashboard, pause manually

Password Rotation

Credential	Variable	Rotation
HF API Token	`HF_API_TOKEN`	Regenerate at huggingface.co/settings/tokens. Update `.env` on bms-4 + GH Secret.
RunPod API Key	`RUNPOD_API_KEY`	Regenerate at runpod.io/console/user/settings. Same procedure.

p24-infra Docs

Explorer

ai-batch-gpu-operations

AI Batch GPU Processing — Operations Workbook

Architecture

Model Selection

Option A — Hugging Face Inference Endpoints

Initial setup (one-time, via HF UI)

API — start/pause/status

Inference call (OpenAI-compatible)

Required secrets (add to `/opt/bms4-services/.env` and GH Secrets)

Option B — RunPod

Why RunPod over HF for batch

Setup (one-time)

API

Required secrets

n8n Workflow — Nightly GPU Batch

Flow

Report config format

Scaling n8n from within n8n

Cost Monitoring

Troubleshooting

Password Rotation

Graph View

Table of Contents

Backlinks

p24-infra Docs

Explorer

ai-batch-gpu-operations

AI Batch GPU Processing — Operations Workbook

Architecture

Model Selection

Option A — Hugging Face Inference Endpoints

Initial setup (one-time, via HF UI)

API — start/pause/status

Inference call (OpenAI-compatible)

Required secrets (add to /opt/bms4-services/.env and GH Secrets)

Option B — RunPod

Why RunPod over HF for batch

Setup (one-time)

API

Required secrets

n8n Workflow — Nightly GPU Batch

Flow

Report config format

Scaling n8n from within n8n

Cost Monitoring

Troubleshooting

Password Rotation

Graph View

Table of Contents

Backlinks

Required secrets (add to `/opt/bms4-services/.env` and GH Secrets)