Self-assessment & reference
AI Infrastructure Reference Architecture
A 20-page reference architecture for teams putting ML and LLM workloads into production. The stack we deploy in client engagements — written down, opinionated, and free.
What's inside: GPU cluster topology decisions, model-serving runtime selection, MLOps platform components, LLM observability, RAG infrastructure patterns, cost-discipline playbook, and on-call patterns specific to AI workloads.
Contents
-
Chapter 01
GPU cluster topology
Node group design, autoscaling envelope, Spot vs. on-demand mix, multi-region resilience patterns. Sample Terraform modules for EKS / GKE / AKS.
-
Chapter 02
Model serving runtime selection
When to pick vLLM, TGI, Triton, or a hosted API. Throughput vs. latency trade-offs, batching strategy, KV-cache management, prompt caching.
-
Chapter 03
MLOps platform layer
Workflow orchestration (Argo Workflows, Kubeflow), model registry (MLflow, Weights & Biases), feature store, experiment tracking. Decision matrix per team size.
-
Chapter 04
LLM observability stack
Tracing (Langfuse, Phoenix, Helicone), eval (Ragas, LangSmith), cost attribution, hallucination tracking. How it integrates with your existing observability.
-
Chapter 05
RAG infrastructure
Vector DB selection (Pinecone, pgvector, Weaviate, Qdrant), embedding pipelines, retrieval and re-ranking, freshness strategies, eval harnesses.
-
Chapter 06
Cost discipline
Per-model and per-tenant cost attribution, Spot orchestration, dynamic batching, speculative decoding, prompt caching. Real numbers from production engagements.
-
Chapter 07
Security and compliance
PHI / PCI / confidential data flows, prompt injection defence, model artefact provenance, audit logging on every model call.
-
Chapter 08
On-call and incidents
SLOs that work for AI workloads, runbook patterns for the failure modes that bite, model rollback as a deploy operation.
Want help applying this to your own AI stack?
A 30-minute call covers the highest-leverage things to do first.
Book a 30-min call →