you are not paying for compute. you are paying for idle.

Most teams think their GPU bill is a compute bill.

It isn't. It's an idle bill. The compute is almost incidental.

Here's the number that broke my brain last year -- at 10% GPU utilization, self-hosted inference on an H100 costs $0.13 per thousand tokens. The same output from a managed API costs $0.02 per thousand tokens. You built infrastructure to be six times more expensive than just calling the API. Congratulations on the infra.

The math works in one direction only: above 90% utilization on sustained, predictable load. That's it. That's the whole constraint. Every team that self-hosts and sits at 40% utilization is paying more than they would have paid OpenAI and also has someone on salary to operate the thing.

I want to run the actual numbers because most people are working from vibes.

H100 on CoreWeave right now: $3.50/hr on-demand. Eight of them in a serving cluster for Llama 3 70B: $28/hr. At 2,500 tokens/second throughput with continuous batching -- which is a real number, not a theoretical one -- you are producing 9 million tokens per hour. Your cost per million tokens is about $3.10.

Together AI charges $3.50/M for the same model. You are barely cheaper. And you paid for the engineers. And you built the deployment pipeline. And you own the on-call rotation. And when vLLM releases an update that breaks something at 3am that is your problem not their problem.

"But at scale the economics flip--" They do. Above 10 billion tokens a month at 90%+ utilization, self-hosting becomes genuinely cheaper. Most teams reading this are not at 10 billion tokens a month. Most teams reading this are running a cool AI product that does 300 million tokens a month and paying $1,800 for their own GPU cluster when the API would have cost $1,050.

The API wins below 10B monthly tokens. Not slightly. Decisively.

The second mistake is buying H100s for inference.

This is the one that actually surprises people. The H100 is a training GPU that got drafted into inference. It has 989 TFLOPS of BF16 compute, NVLink at 900 GB/s, FP8 support -- all of which are training features that inference workloads underutilize heavily. You are paying for capability you cannot use because inference is memory-bandwidth-bound, not compute-bound.

An L40S costs $1.49/hr on Hyperbolic. An H100 costs $3.20/hr. The L40S delivers comparable cost-per-token for 7B-30B model inference because the binding constraint is HBM bandwidth and the L40S has enough of it for that workload range. You are paying 2x for hardware whose differentiating features do not matter for the thing you are doing.

This is not universally true. 70B+ models need the H100's memory capacity. Very high batch sizes need the compute. But the team running Llama 3 8B for a production use case on $3.50/hr H100s when $1.49/hr L40Ss would serve the same throughput is just... leaving money on the table. Quietly. Every hour.

The formula that matters is not hourly rate. It is:

Cost Per Token = Hourly Rate ÷ (System Throughput × 3,600)

An H200 at $2.50/hr with 5,000 tokens/second is cheaper per token than an H100 at $2.00/hr with 3,000 tokens/second. The more expensive GPU is the cheaper GPU because you are buying throughput, not time.

The hyperscaler premium is real and most people pay it out of habit.

AWS H100 instances: $12.30/hr. CoreWeave: $3.50/hr. Lambda: $2.99/hr. Hyperbolic: $3.20/hr. Same GPU. The hyperscaler charges 3-4x and adds egress fees on top -- typically $0.08-$0.12/GB -- which on a high-traffic inference endpoint adds 10-20% to the monthly bill before you notice it.

The hyperscaler has an actual value proposition: ecosystem integration, SLA guarantees, compliance tooling, SageMaker, Vertex, Azure ML. If you are a regulated enterprise that needs those things, pay for them. If you are a startup running vLLM on Kubernetes and calling it done, you are paying $12.30/hr for $3.50/hr of actual GPU and $8.80/hr of infrastructure you reimplemented yourself.

There's also the virtualization overhead nobody mentions. Hyperscaler GPU VMs add hypervisor overhead that reduces memory bandwidth utilization by roughly 10-15%. Your effective hourly rate is not $4/hr. It is $4/hr ÷ 0.85 = $4.70/hr. Bare metal instances don't have this. You get the rated performance. That gap is pure margin on high-throughput serving workloads.

The Jevons Paradox is eating everyone's inference budget and nobody is talking about it by name.

GPT-4 equivalent inference cost $20 per million tokens in late 2022. It costs $0.40 today -- a 50x reduction in three years. Inference is 1,000x cheaper than it was at ChatGPT launch. The Jevons Paradox says: when a resource becomes more efficient, total consumption of that resource increases because efficiency enables new use cases. Per-token cost dropped 1,000x. Total inference spend grew 320%. The efficiency gains made AI economically viable for use cases that couldn't exist before, which created demand that didn't exist before, which consumed the savings and then some.

This is not a problem. It's how technology diffuses. But it means you cannot cost-reduce your way out of an inference budget by just finding cheaper hardware. If you cut per-token cost by 50%, you will likely serve 2x the traffic within a year. The bill stays flat or grows. The optimization you actually need is utilization -- filling the GPUs you have before renting more GPUs.

The number I track every week on running inference workloads: effective GPU utilization.

Not the number in the dashboard that says "GPU 87%" because the GPU is technically doing something. The number that answers: what fraction of my theoretical token throughput am I actually delivering? If I have 2,500 tokens/second of theoretical capacity and I am serving 800 tokens/second average across the day, I am at 32% utilization. I am paying for 2,500 and using 800. The other 1,700 tokens/second of capacity are money sitting idle on a rack.

Continuous batching helps. Dynamic batching helps. Autoscaling down during off-peak hours helps. Prefill-decode disaggregation helps because it means your decode capacity doesn't sit idle waiting for prefill to finish. All of these optimizations are about the same thing: filling the GPU before paying for the next one.

The teams spending the least per token are not the teams with the best hardware.

They are the teams with the highest utilization on whatever hardware they have.

h100 at 10% utilization: $0.13 per thousand tokens.

managed api: $0.02 per thousand tokens.

six times more expensive. plus the engineer. plus the on-call.

the question is never which gpu. the question is always how full it is.