Self-assessment & reference

AI Infrastructure Reference Architecture

A 20-page reference architecture for teams putting ML and LLM workloads into production. The stack we deploy in client engagements — written down, opinionated, and free.

What's inside: GPU cluster topology decisions, model-serving runtime selection, MLOps platform components, LLM observability, RAG infrastructure patterns, cost-discipline playbook, and on-call patterns specific to AI workloads.

Chapter 01

GPU cluster topology

Node group design, autoscaling envelope, Spot vs. on-demand mix, multi-region resilience patterns. Sample Terraform modules for EKS / GKE / AKS.
Chapter 02

Model serving runtime selection

When to pick vLLM, TGI, Triton, or a hosted API. Throughput vs. latency trade-offs, batching strategy, KV-cache management, prompt caching.
Chapter 03

MLOps platform layer

Workflow orchestration (Argo Workflows, Kubeflow), model registry (MLflow, Weights & Biases), feature store, experiment tracking. Decision matrix per team size.
Chapter 04

LLM observability stack

Tracing (Langfuse, Phoenix, Helicone), eval (Ragas, LangSmith), cost attribution, hallucination tracking. How it integrates with your existing observability.
Chapter 05

RAG infrastructure

Vector DB selection (Pinecone, pgvector, Weaviate, Qdrant), embedding pipelines, retrieval and re-ranking, freshness strategies, eval harnesses.
Chapter 06

Cost discipline

Per-model and per-tenant cost attribution, Spot orchestration, dynamic batching, speculative decoding, prompt caching. Real numbers from production engagements.
Chapter 07

Security and compliance

PHI / PCI / confidential data flows, prompt injection defence, model artefact provenance, audit logging on every model call.
Chapter 08

On-call and incidents

SLOs that work for AI workloads, runbook patterns for the failure modes that bite, model rollback as a deploy operation.

Want help applying this to your own AI stack?

A 30-minute call covers the highest-leverage things to do first.

Book a 30-min call →

AI Infrastructure Reference Architecture

Contents

GPU cluster topology

Model serving runtime selection

MLOps platform layer

LLM observability stack

RAG infrastructure

Cost discipline

Security and compliance

On-call and incidents

Want help applying this to your own AI stack?