Tools · Free, no signup
GPU Cost Calculator for LLM Inference
A back-of-the-envelope estimator for the monthly cost of running an LLM inference workload on cloud GPUs. Punch in your traffic, pick a GPU, see what it would cost on-demand vs. Spot. Useful for sanity-checking a vendor quote or scoping a migration.
Numbers are approximate (cloud GPU prices and throughput shift constantly). Don't quote this to a CFO — but do quote it to your engineering team to start a real conversation.
How this is calculated
We multiply your peak request rate by your avg output tokens to get tokens-per-second demand, then divide by the GPU's tokens-per-second capacity (a vLLM-style throughput figure for the model class the GPU is sized for). Replicas needed = demand / (capacity × utilisation), rounded up. Effective hourly cost depends on the capacity model: on-demand uses list price, Spot uses ~35–40% of list, mixed blends both.
What this doesn't account for: storage, networking, observability, ingress, KV cache memory pressure, prompt caching wins, batch effects, multi-region replication, or cold-start cost. For a real production estimate, talk to a human (us, or another).
Want a real estimate, with the caveats this tool can't capture?
A 30-minute call. We've sized GPU footprints for inference workloads from 1 to 1,000 RPS.
Book a 30-min call →