Blog · May 8, 2026 · 6 min
Running vLLM on EKS — a production checklist
The non-obvious settings that decide whether your LLM serving is reliable at p99, not just at p50.
By HarmanJyot Kaur
Most teams get vLLM working in a single GKE or EKS pod in an afternoon. Running it in production — with predictable p99 latency, sane autoscaling, and survivable failure modes — takes longer. This is the checklist we walk new clients through.
1. Pick the right GPU node group
For 7B–13B models, an g5.xlarge (A10G, 24 GB) is usually enough for a single replica with continuous batching. For 70B, you’re on p4d / p5 (A100 / H100) territory. The traps:
- Spot vs. on-demand. Inference workloads can tolerate Spot for non-prime-time tiers; user-facing prod usually shouldn’t. Mix capacity types per traffic class, don’t average them.
- Quota. EKS will happily provision a node group whose instance type your account has zero quota for. Confirm GPU quota in
service-quotasbefore declaring “deployable.” - NVIDIA device plugin must be installed and the model must request
nvidia.com/gpu: 1in the pod spec. Forgetting this is the #1 reason a pod sits Pending.
2. Tune the serving runtime
vLLM’s defaults are good, but two flags change everything:
--max-model-len— set this to the largest context you actually serve, not the model’s max. Bigger context means bigger KV cache means fewer concurrent requests.--gpu-memory-utilization— default0.9is fine in isolation. If you’re co-tenanting with monitoring sidecars or running inference + a small embedding model in the same pod, drop to0.85.
--enable-prefix-caching is almost always worth it for chat workloads where the system prompt is fixed.
3. SLOs that mean something
Three SLIs we put on every dashboard:
- First-token latency p99. Target ≤ 500 ms for chat UIs. This is what users feel.
- Tokens-per-second p50. Target depends on the model; track the trend, not the absolute.
- Error rate / OOM rate. Anything > 0.1% means your
max-model-lenis too generous.
4. Autoscaling that doesn’t lie
HPA on CPU is meaningless for GPU workloads. Use a custom metric — request queue depth, in-flight requests, or first-token latency. KEDA + Prometheus is the path of least resistance. Scale-down delay needs to be generous (5–10 minutes) because cold-start pulls a multi-GB model image.
5. Failure modes to rehearse
Three to drill before launch:
- GPU pulled from under you. Spot interruption mid-stream. The client should retry on a different replica without the user noticing.
- OOM on a long-context request. Should return a clean 413, not crash the pod.
- Image pull throttling on scale-up. Pre-pull on node bootstrap, or use a node group with a warm image cache.
What we’d do next
Run synthetic load against the cluster with vllm-bench at 2× projected peak for 30 minutes. The interesting numbers don’t show up at 5-minute spot checks.
If you want a second set of eyes on a vLLM deployment before it goes to customers, book a 30-min call. We’ve done this enough times that the failure modes look familiar.
Tags