Skip to content
CloudWizz

Self-assessment & reference

AI Infrastructure Reference Architecture

A 20-page reference architecture for teams putting ML and LLM workloads into production. The stack we deploy in client engagements — written down, opinionated, and free.

What's inside: GPU cluster topology decisions, model-serving runtime selection, MLOps platform components, LLM observability, RAG infrastructure patterns, cost-discipline playbook, and on-call patterns specific to AI workloads.

Contents

  • Chapter 01

    GPU cluster topology

    Node group design, autoscaling envelope, Spot vs. on-demand mix, multi-region resilience patterns. Sample Terraform modules for EKS / GKE / AKS.

  • Chapter 02

    Model serving runtime selection

    When to pick vLLM, TGI, Triton, or a hosted API. Throughput vs. latency trade-offs, batching strategy, KV-cache management, prompt caching.

  • Chapter 03

    MLOps platform layer

    Workflow orchestration (Argo Workflows, Kubeflow), model registry (MLflow, Weights & Biases), feature store, experiment tracking. Decision matrix per team size.

  • Chapter 04

    LLM observability stack

    Tracing (Langfuse, Phoenix, Helicone), eval (Ragas, LangSmith), cost attribution, hallucination tracking. How it integrates with your existing observability.

  • Chapter 05

    RAG infrastructure

    Vector DB selection (Pinecone, pgvector, Weaviate, Qdrant), embedding pipelines, retrieval and re-ranking, freshness strategies, eval harnesses.

  • Chapter 06

    Cost discipline

    Per-model and per-tenant cost attribution, Spot orchestration, dynamic batching, speculative decoding, prompt caching. Real numbers from production engagements.

  • Chapter 07

    Security and compliance

    PHI / PCI / confidential data flows, prompt injection defence, model artefact provenance, audit logging on every model call.

  • Chapter 08

    On-call and incidents

    SLOs that work for AI workloads, runbook patterns for the failure modes that bite, model rollback as a deploy operation.

Want help applying this to your own AI stack?

A 30-minute call covers the highest-leverage things to do first.

Book a 30-min call →