AI Batch GPU Processing — Operations Workbook
Nightly GPU inference for heavy report generation. GPU server spins up at 02:00 UTC, runs reports, shuts down. Pay only for ~4 hours/night.
Architecture
02:00 UTC
├── n8n cron trigger
│ ├── scale down to 1 n8n worker (worker-1 stays for GPS cron)
│ ├── resume GPU endpoint (HF / RunPod)
│ ├── wait ~4 min (cold start)
│ ├── POST reports to model API (OpenAI-compatible)
│ ├── pause GPU endpoint (billing stops)
│ └── restore n8n workers (worker-2 + worker-3)
│
└── GPU endpoint (HF Inference Endpoint or RunPod pod)
└── vLLM or TGI serving the model
└── POST /v1/chat/completions ← OpenAI-compatibleCost at 4 hours/night × 20 nights/month:
| Provider | Hardware | $/hr | Monthly |
|---|---|---|---|
| HF Inference Endpoints | 1× A100 80GB | ~$3.50 | ~$280 |
| RunPod On-Demand | 1× A100 80GB | ~$2.00 | ~$160 |
| RunPod Spot | 1× A100 80GB | ~$1.30 | ~$104 |
| Vast.ai | 1× A100 80GB | ~$1.20 | ~$96 |
Model Selection
Models that fit on 1× A100 80GB (most cost-efficient):
| Model | Size | Strength | HF Hub ID |
|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-32B | 32B | Reasoning, analysis | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
| Qwen2.5-32B-Instruct | 32B | General, multilingual | Qwen/Qwen2.5-32B-Instruct |
| Meta-Llama-3.1-70B-Instruct | 70B | Strong general (needs ~75 GB — tight on A100 80GB with quantization) | meta-llama/Meta-Llama-3.1-70B-Instruct |
For 70B+ models use 2× A100 or 1× H100 (price doubles but quality jump is significant for complex reports).
Recommended start: Qwen2.5-32B-Instruct — fast, fits comfortably, strong multilingual (Polish/German/English), good instruction following.
Option A — Hugging Face Inference Endpoints
Initial setup (one-time, via HF UI)
- Go to
huggingface.co/inference-endpoints - New endpoint → pick model (e.g.
Qwen/Qwen2.5-32B-Instruct) - Hardware:
1× NVIDIA A100 80GB - Framework: Text Generation Inference (TGI)
- Scaling: min=0, max=1 (scales to zero when paused)
- Region: eu-west-1 (closest to bms-4)
- Note the endpoint URL and name
API — start/pause/status
# Base: https://api.endpoints.huggingface.tech/v2/endpoint/{namespace}/{name}
# Auth: Authorization: Bearer ${HF_API_TOKEN}
# Resume (billing starts, cold start ~3-5 min)
curl -X PUT "https://api.endpoints.huggingface.tech/v2/endpoint/radieu/nightly-reports/resume" \
-H "Authorization: Bearer ${HF_API_TOKEN}"
# Check status
curl "https://api.endpoints.huggingface.tech/v2/endpoint/radieu/nightly-reports" \
-H "Authorization: Bearer ${HF_API_TOKEN}" | jq '.status.state'
# Returns: "scaledToZero" | "pending" | "initializing" | "running" | "paused"
# Pause (billing stops immediately)
curl -X PUT "https://api.endpoints.huggingface.tech/v2/endpoint/radieu/nightly-reports/pause" \
-H "Authorization: Bearer ${HF_API_TOKEN}"Inference call (OpenAI-compatible)
curl "https://{endpoint-url}/v1/chat/completions" \
-H "Authorization: Bearer ${HF_API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"model": "tgi",
"messages": [{"role": "user", "content": "Generate fleet report for..."}],
"max_tokens": 4096,
"temperature": 0.3
}'Required secrets (add to /opt/bms4-services/.env and GH Secrets)
| Variable | Value |
|---|---|
HF_API_TOKEN | HF token with inference-endpoints:write scope |
HF_ENDPOINT_NAMESPACE | radieu |
HF_ENDPOINT_NAME | e.g. nightly-reports |
HF_ENDPOINT_URL | URL from HF dashboard |
Option B — RunPod
Why RunPod over HF for batch
- ~40% cheaper on-demand, ~60% cheaper on spot
- Serverless option: pay per second of actual compute, no idle billing even while “running”
- Better for bursty workloads (reports that take varying time)
Setup (one-time)
- Create account at
runpod.io - Deploy → Serverless → New Endpoint
- Select template:
vLLM(pre-built, OpenAI-compatible) - Worker: 1× A100 80GB, max workers: 1
- Set environment:
MODEL_NAME=Qwen/Qwen2.5-32B-Instruct - Note the endpoint ID
API
# Serverless: no start/stop needed — scales to zero automatically
# Just POST jobs, RunPod spins up a worker on demand
# Inference
curl "https://api.runpod.ai/v2/{endpoint-id}/runsync" \
-H "Authorization: Bearer ${RUNPOD_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"input": {
"messages": [{"role": "user", "content": "Generate report..."}],
"max_tokens": 4096,
"temperature": 0.3
}
}'Required secrets
| Variable | Value |
|---|---|
RUNPOD_API_KEY | RunPod API key |
RUNPOD_ENDPOINT_ID | Serverless endpoint ID |
n8n Workflow — Nightly GPU Batch
Workflow file: infra-src/n8n-workflows/nightly-gpu-batch.json
Flow
02:00 cron
→ Scale n8n: stop worker-2, worker-3
→ Resume HF endpoint (or RunPod wakes on first call)
→ Wait 4 min (cold start)
→ Poll status loop (max 10 attempts × 30s)
→ IF status == "running"
→ Run each report config (split in batches)
→ POST /v1/chat/completions
→ save result to Supabase / upload to GDrive
→ Pause HF endpoint
→ Start worker-2, worker-3
→ Discord: ✅ reports done, cost estimate
ELSE (timeout)
→ Pause endpoint (don't leave it running)
→ Start workers
→ Discord: ❌ GPU endpoint failed to start
Report config format
Each report is a row in audit.actions with action_type = 'ai_workbook'. The workflow reads the active actions, generates the report via GPU endpoint instead of claude-proxy, and writes the result to audit.runs.
Scaling n8n from within n8n
The workflow calls a local script via SSH or exec node:
# Scale down before testing
cd /opt/bms4-services && docker compose stop n8n-worker-2 n8n-worker-3
# Scale up after
cd /opt/bms4-services && docker compose up -d n8n-worker-2 n8n-worker-3Cost Monitoring
Add to Prometheus / cost-exporter:
| Metric | Description |
|---|---|
gpu_batch_session_duration_seconds | How long endpoint was running |
gpu_batch_reports_total | Reports processed per session |
gpu_batch_estimated_cost_eur | duration × hourly rate |
gpu_batch_last_success_timestamp | Last successful run |
Script writes .prom file to node_exporter textfile collector after each session.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Endpoint stuck in “initializing” | Model too large for hardware tier | Upgrade hardware or use smaller model |
| 503 from inference endpoint | Cold start not complete | Increase wait time in n8n (try 6 min) |
| Reports incomplete | Token limit hit | Increase max_tokens or split report into sections |
| n8n workers not restored | Workflow crashed mid-run | Run testing-window-end.sh manually |
| HF billing continued | Pause call failed | Check HF dashboard, pause manually |
Password Rotation
| Credential | Variable | Rotation |
|---|---|---|
| HF API Token | HF_API_TOKEN | Regenerate at huggingface.co/settings/tokens. Update .env on bms-4 + GH Secret. |
| RunPod API Key | RUNPOD_API_KEY | Regenerate at runpod.io/console/user/settings. Same procedure. |