AI Batch GPU Processing — Operations Workbook

Nightly GPU inference for heavy report generation. GPU server spins up at 02:00 UTC, runs reports, shuts down. Pay only for ~4 hours/night.


Architecture

02:00 UTC
├── n8n cron trigger
│   ├── scale down to 1 n8n worker (worker-1 stays for GPS cron)
│   ├── resume GPU endpoint (HF / RunPod)
│   ├── wait ~4 min (cold start)
│   ├── POST reports to model API (OpenAI-compatible)
│   ├── pause GPU endpoint (billing stops)
│   └── restore n8n workers (worker-2 + worker-3)

└── GPU endpoint (HF Inference Endpoint or RunPod pod)
    └── vLLM or TGI serving the model
        └── POST /v1/chat/completions  ← OpenAI-compatible

Cost at 4 hours/night × 20 nights/month:

ProviderHardware$/hrMonthly
HF Inference Endpoints1× A100 80GB~$3.50~$280
RunPod On-Demand1× A100 80GB~$2.00~$160
RunPod Spot1× A100 80GB~$1.30~$104
Vast.ai1× A100 80GB~$1.20~$96

Model Selection

Models that fit on 1× A100 80GB (most cost-efficient):

ModelSizeStrengthHF Hub ID
DeepSeek-R1-Distill-Qwen-32B32BReasoning, analysisdeepseek-ai/DeepSeek-R1-Distill-Qwen-32B
Qwen2.5-32B-Instruct32BGeneral, multilingualQwen/Qwen2.5-32B-Instruct
Meta-Llama-3.1-70B-Instruct70BStrong general (needs ~75 GB — tight on A100 80GB with quantization)meta-llama/Meta-Llama-3.1-70B-Instruct

For 70B+ models use 2× A100 or 1× H100 (price doubles but quality jump is significant for complex reports).

Recommended start: Qwen2.5-32B-Instruct — fast, fits comfortably, strong multilingual (Polish/German/English), good instruction following.


Option A — Hugging Face Inference Endpoints

Initial setup (one-time, via HF UI)

  1. Go to huggingface.co/inference-endpoints
  2. New endpoint → pick model (e.g. Qwen/Qwen2.5-32B-Instruct)
  3. Hardware: 1× NVIDIA A100 80GB
  4. Framework: Text Generation Inference (TGI)
  5. Scaling: min=0, max=1 (scales to zero when paused)
  6. Region: eu-west-1 (closest to bms-4)
  7. Note the endpoint URL and name

API — start/pause/status

# Base: https://api.endpoints.huggingface.tech/v2/endpoint/{namespace}/{name}
# Auth: Authorization: Bearer ${HF_API_TOKEN}
 
# Resume (billing starts, cold start ~3-5 min)
curl -X PUT "https://api.endpoints.huggingface.tech/v2/endpoint/radieu/nightly-reports/resume" \
  -H "Authorization: Bearer ${HF_API_TOKEN}"
 
# Check status
curl "https://api.endpoints.huggingface.tech/v2/endpoint/radieu/nightly-reports" \
  -H "Authorization: Bearer ${HF_API_TOKEN}" | jq '.status.state'
# Returns: "scaledToZero" | "pending" | "initializing" | "running" | "paused"
 
# Pause (billing stops immediately)
curl -X PUT "https://api.endpoints.huggingface.tech/v2/endpoint/radieu/nightly-reports/pause" \
  -H "Authorization: Bearer ${HF_API_TOKEN}"

Inference call (OpenAI-compatible)

curl "https://{endpoint-url}/v1/chat/completions" \
  -H "Authorization: Bearer ${HF_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "Generate fleet report for..."}],
    "max_tokens": 4096,
    "temperature": 0.3
  }'

Required secrets (add to /opt/bms4-services/.env and GH Secrets)

VariableValue
HF_API_TOKENHF token with inference-endpoints:write scope
HF_ENDPOINT_NAMESPACEradieu
HF_ENDPOINT_NAMEe.g. nightly-reports
HF_ENDPOINT_URLURL from HF dashboard

Option B — RunPod

Why RunPod over HF for batch

  • ~40% cheaper on-demand, ~60% cheaper on spot
  • Serverless option: pay per second of actual compute, no idle billing even while “running”
  • Better for bursty workloads (reports that take varying time)

Setup (one-time)

  1. Create account at runpod.io
  2. Deploy → Serverless → New Endpoint
  3. Select template: vLLM (pre-built, OpenAI-compatible)
  4. Worker: 1× A100 80GB, max workers: 1
  5. Set environment: MODEL_NAME=Qwen/Qwen2.5-32B-Instruct
  6. Note the endpoint ID

API

# Serverless: no start/stop needed — scales to zero automatically
# Just POST jobs, RunPod spins up a worker on demand
 
# Inference
curl "https://api.runpod.ai/v2/{endpoint-id}/runsync" \
  -H "Authorization: Bearer ${RUNPOD_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "messages": [{"role": "user", "content": "Generate report..."}],
      "max_tokens": 4096,
      "temperature": 0.3
    }
  }'

Required secrets

VariableValue
RUNPOD_API_KEYRunPod API key
RUNPOD_ENDPOINT_IDServerless endpoint ID

n8n Workflow — Nightly GPU Batch

Workflow file: infra-src/n8n-workflows/nightly-gpu-batch.json

Flow

02:00 cron
  → Scale n8n: stop worker-2, worker-3
  → Resume HF endpoint (or RunPod wakes on first call)
  → Wait 4 min (cold start)
  → Poll status loop (max 10 attempts × 30s)
  → IF status == "running"
      → Run each report config (split in batches)
          → POST /v1/chat/completions
          → save result to Supabase / upload to GDrive
      → Pause HF endpoint
      → Start worker-2, worker-3
      → Discord: ✅ reports done, cost estimate
    ELSE (timeout)
      → Pause endpoint (don't leave it running)
      → Start workers
      → Discord: ❌ GPU endpoint failed to start

Report config format

Each report is a row in audit.actions with action_type = 'ai_workbook'. The workflow reads the active actions, generates the report via GPU endpoint instead of claude-proxy, and writes the result to audit.runs.

Scaling n8n from within n8n

The workflow calls a local script via SSH or exec node:

# Scale down before testing
cd /opt/bms4-services && docker compose stop n8n-worker-2 n8n-worker-3
 
# Scale up after
cd /opt/bms4-services && docker compose up -d n8n-worker-2 n8n-worker-3

Cost Monitoring

Add to Prometheus / cost-exporter:

MetricDescription
gpu_batch_session_duration_secondsHow long endpoint was running
gpu_batch_reports_totalReports processed per session
gpu_batch_estimated_cost_eurduration × hourly rate
gpu_batch_last_success_timestampLast successful run

Script writes .prom file to node_exporter textfile collector after each session.


Troubleshooting

SymptomCauseFix
Endpoint stuck in “initializing”Model too large for hardware tierUpgrade hardware or use smaller model
503 from inference endpointCold start not completeIncrease wait time in n8n (try 6 min)
Reports incompleteToken limit hitIncrease max_tokens or split report into sections
n8n workers not restoredWorkflow crashed mid-runRun testing-window-end.sh manually
HF billing continuedPause call failedCheck HF dashboard, pause manually

Password Rotation

CredentialVariableRotation
HF API TokenHF_API_TOKENRegenerate at huggingface.co/settings/tokens. Update .env on bms-4 + GH Secret.
RunPod API KeyRUNPOD_API_KEYRegenerate at runpod.io/console/user/settings. Same procedure.