GPU Infrastructure Economics: Cost Optimization Strategies

The Real Cost of GPU Compute

An H100 SXM5 instance from a major cloud provider costs roughly $3.50–$5.00 per GPU-hour on demand. A training run for a mid-sized language model can consume thousands of GPU-hours. At that scale, a 20% improvement in utilization is worth more than any other optimization you can make. Most teams, however, focus on model architecture and data pipelines and treat infrastructure costs as a fixed line item — which is where significant money gets left on the table.

The first step in controlling GPU costs is measurement. Raw spend is not the right metric; cost per useful unit of work is. For training, that means cost per training step or cost per token processed. For inference, it means cost per 1,000 tokens generated at your latency target. Without these numbers, it's impossible to know whether a change to batch size, quantization level, or instance type actually improved economics or just shifted cost elsewhere.

Matching GPU Class to Workload Requirements

Not every workload needs an H100. The H100 is optimized for FP8 and FP16 training with transformer models and NVLink-connected multi-GPU scaling. For inference of smaller models — 7B to 13B parameters — an A10G or L4 often delivers better cost-per-token because those GPUs are cheaper per hour and the workload doesn't saturate the higher-end memory bandwidth or tensor core throughput of an H100.

Quantization changes this calculus further. A 70B model in INT4 fits in approximately 35 GB of GPU memory and runs inference at speeds comparable to the same model in FP16 on twice as many GPUs. Libraries like GPTQ and AWQ make INT4 quantization straightforward for most transformer architectures, with acceptable quality degradation for many production use cases. Running quantized inference on A100 80GB instances instead of H100 pairs can cut inference costs by 40–60% for latency-tolerant applications.

Spot and Preemptible Instances for Training

Spot and preemptible GPU instances are the same physical hardware as on-demand instances, available at discounts of 50–80% when cloud providers have excess capacity. The trade-off is that the instance can be reclaimed with short notice — typically 30 seconds to 2 minutes — when on-demand demand increases. For long training runs, this sounds prohibitive, but most modern training frameworks handle it well with checkpointing.

The pattern is straightforward: checkpoint model state to durable storage (S3, GCS, or similar) every N steps, where N is small enough that losing one checkpoint interval is acceptable. When the spot instance is reclaimed, the job restarts from the last checkpoint on a new instance. With automatic restarts handled by the orchestration layer, spot interruptions become a minor nuisance rather than a catastrophe. Teams running large training jobs on spot instances routinely achieve effective discounts of 60–70% vs. on-demand pricing with checkpoint intervals of 5–15 minutes.

Improving Utilization with Continuous Batching

GPU utilization on inference workloads is frequently poor. A model server waiting for requests sits idle; a server processing requests one at a time wastes GPU cycles between token generations. Static batching — collecting requests into fixed-size batches before processing — improves throughput but adds latency for requests that arrive late in a batch window.

Continuous batching, implemented in vLLM and TGI, solves this by treating each forward pass as an opportunity to add new requests to the in-flight batch. As sequences complete and free up memory, new requests are inserted immediately. This keeps the GPU busy at all times and reduces average latency compared to static batching under moderate-to-high load. In production, continuous batching typically achieves 2–4x higher throughput on the same hardware compared to naive per-request serving, which directly translates to fewer GPU instances needed to handle the same request volume.

Chargeback and Cost Attribution Across Teams

When multiple teams share a GPU cluster, allocating costs accurately is essential both for budgeting and for creating the right incentives. Kubernetes resource quotas and namespace-level resource limits enforce boundaries, but they don't produce the per-team cost reports that finance teams need. The NVIDIA DCGM Exporter combined with Prometheus labels on job and namespace can produce GPU-hour consumption metrics per team, which can be mapped to dollar costs using your provider's hourly rates.

Accurate chargeback changes team behavior. When teams see the cost of leaving a development GPU allocated overnight versus releasing it, utilization improves without top-down mandates. When a team optimizes their batch size and sees a corresponding drop in their monthly GPU bill, the incentive to keep optimizing is self-reinforcing. Infrastructure cost visibility is as important as the technical optimizations themselves for sustaining long-term efficiency improvements.