Skip to content
CloudWizz

Blog · May 8, 2026 · 6 min

Running vLLM on EKS — a production checklist

The non-obvious settings that decide whether your LLM serving is reliable at p99, not just at p50.

By HarmanJyot Kaur

Most teams get vLLM working in a single GKE or EKS pod in an afternoon. Running it in production — with predictable p99 latency, sane autoscaling, and survivable failure modes — takes longer. This is the checklist we walk new clients through.

1. Pick the right GPU node group

For 7B–13B models, an g5.xlarge (A10G, 24 GB) is usually enough for a single replica with continuous batching. For 70B, you’re on p4d / p5 (A100 / H100) territory. The traps:

  • Spot vs. on-demand. Inference workloads can tolerate Spot for non-prime-time tiers; user-facing prod usually shouldn’t. Mix capacity types per traffic class, don’t average them.
  • Quota. EKS will happily provision a node group whose instance type your account has zero quota for. Confirm GPU quota in service-quotas before declaring “deployable.”
  • NVIDIA device plugin must be installed and the model must request nvidia.com/gpu: 1 in the pod spec. Forgetting this is the #1 reason a pod sits Pending.

2. Tune the serving runtime

vLLM’s defaults are good, but two flags change everything:

  • --max-model-len — set this to the largest context you actually serve, not the model’s max. Bigger context means bigger KV cache means fewer concurrent requests.
  • --gpu-memory-utilization — default 0.9 is fine in isolation. If you’re co-tenanting with monitoring sidecars or running inference + a small embedding model in the same pod, drop to 0.85.

--enable-prefix-caching is almost always worth it for chat workloads where the system prompt is fixed.

3. SLOs that mean something

Three SLIs we put on every dashboard:

  • First-token latency p99. Target ≤ 500 ms for chat UIs. This is what users feel.
  • Tokens-per-second p50. Target depends on the model; track the trend, not the absolute.
  • Error rate / OOM rate. Anything > 0.1% means your max-model-len is too generous.

4. Autoscaling that doesn’t lie

HPA on CPU is meaningless for GPU workloads. Use a custom metric — request queue depth, in-flight requests, or first-token latency. KEDA + Prometheus is the path of least resistance. Scale-down delay needs to be generous (5–10 minutes) because cold-start pulls a multi-GB model image.

5. Failure modes to rehearse

Three to drill before launch:

  1. GPU pulled from under you. Spot interruption mid-stream. The client should retry on a different replica without the user noticing.
  2. OOM on a long-context request. Should return a clean 413, not crash the pod.
  3. Image pull throttling on scale-up. Pre-pull on node bootstrap, or use a node group with a warm image cache.

What we’d do next

Run synthetic load against the cluster with vllm-bench at 2× projected peak for 30 minutes. The interesting numbers don’t show up at 5-minute spot checks.

If you want a second set of eyes on a vLLM deployment before it goes to customers, book a 30-min call. We’ve done this enough times that the failure modes look familiar.

Tags

aikubernetesvllmllm

Have a project that could use a sharper opinion?

Book a 30-min call →