Tools · Free, no signup

GPU Cost Calculator for LLM Inference

A back-of-the-envelope estimator for the monthly cost of running an LLM inference workload on cloud GPUs. Punch in your traffic, pick a GPU, see what it would cost on-demand vs. Spot. Useful for sanity-checking a vendor quote or scoping a migration.

Numbers are approximate (cloud GPU prices and throughput shift constantly). Don't quote this to a CFO — but do quote it to your engineering team to start a real conversation.

How this is calculated

We multiply your peak request rate by your avg output tokens to get tokens-per-second demand, then divide by the GPU's tokens-per-second capacity (a vLLM-style throughput figure for the model class the GPU is sized for). Replicas needed = demand / (capacity × utilisation), rounded up. Effective hourly cost depends on the capacity model: on-demand uses list price, Spot uses ~35–40% of list, mixed blends both.

What this doesn't account for: storage, networking, observability, ingress, KV cache memory pressure, prompt caching wins, batch effects, multi-region replication, or cold-start cost. For a real production estimate, talk to a human (us, or another).

Want a real estimate, with the caveats this tool can't capture?

A 30-minute call. We've sized GPU footprints for inference workloads from 1 to 1,000 RPS.

Book a 30-min call →