Skip to content
CloudWizz

Blog · May 5, 2026 · 5 min

GPU node pools on Kubernetes — five sharp edges to know

The non-obvious failure modes that bite teams running their first GPU workload on EKS, GKE, or AKS.

By HarmanJyot Kaur

Most “Kubernetes for AI” guides cover the happy path. This is the list of things that have surprised us — or our clients — when GPU workloads first hit production.

1. Quota is per region, per account, per instance family

You can terraform apply a Karpenter NodePool that requests p4d.24xlarge and have it sit Pending forever, with no obvious error, because your account has zero p4d quota in that region. Check service-quotas (AWS), compute.googleapis.com quotas (GCP), or the Azure portal before the deploy. Asking for a raise mid-incident is slower than you’d expect.

2. The NVIDIA device plugin is not optional

Without it, your GPU pods schedule onto GPU nodes — and then fail because Kubernetes doesn’t know GPUs exist. Symptom: nvidia.com/gpu resource doesn’t appear in kubectl describe node. Fix: install the NVIDIA device plugin DaemonSet, or use the GPU Operator if you also need driver management.

3. Image pull is your cold-start tax

A vLLM image with a 7B model pre-baked is 15+ GB. Pulling that on first scale-up adds 60–120 seconds to your scale-up latency. Three mitigations, in order of effort:

  1. Pre-pull common images on node bootstrap (cloud-init or DaemonSet).
  2. Use a node group with a custom AMI that has the image cached.
  3. Don’t bake the model into the image — load from S3/GCS via init container and a shared emptyDir.

4. Spot interruptions hit harder than you expect

GPU Spot pricing is great — until your inference replica gets pulled mid-stream and the user sees a hung connection. Two things to do:

  • Run a Spot-aware ingress that retries on a different replica when one disappears (Envoy or NGINX with proper retry config).
  • For latency-critical traffic, route to on-demand replicas with Spot as overflow capacity. Pure-Spot is fine for batch.

5. KEDA-style scaling beats HPA-on-CPU

GPU utilization isn’t visible to the standard HPA. CPU utilization on an inference pod is a poor proxy — it’ll be 5% while the GPU is saturated. Use a custom metric:

  • Request queue depth (vLLM exposes this).
  • In-flight requests.
  • First-token latency.

KEDA + Prometheus exporter is the easiest path. Set scale-down delay to 5–10 minutes — model warm-up cost matters.

What to do with this

Treat GPU workloads as their own platform domain. The failure modes don’t look like CPU workloads, the cost economics don’t look like CPU workloads, and the on-call playbook needs to reflect that. We’ve packaged the patterns that survive into our AI Infrastructure service.

Tags

aikubernetesgpueksgke

Have a project that could use a sharper opinion?

Book a 30-min call →