Building AI Infrastructure with GPU Kubernetes and Open Source Services

Why AI Workloads Demand Specialized Infrastructure

Training a large language model or running inference at scale isn't like hosting a web application. GPU memory is the binding constraint — an A100 has 80 GB of HBM2e that must be allocated carefully across model weights, KV cache, and activation tensors. Latency requirements for real-time inference differ dramatically from throughput requirements for batch processing. The infrastructure layer has to understand these distinctions, not treat GPUs as expensive CPUs.

Network topology matters too. Multi-GPU training jobs using tensor or pipeline parallelism require high-bandwidth, low-latency interconnects between nodes. NVLink within a node and InfiniBand between nodes are standard in serious training clusters. When scheduling training jobs across a Kubernetes cluster, the scheduler needs to be aware of NUMA topology and NVLink connectivity — generic bin-packing will land GPUs on different racks and destroy throughput.

Kubernetes GPU Scheduling: The Basics

Kubernetes gains GPU awareness through the NVIDIA GPU Operator, which installs and manages the device plugin, container runtime, drivers, and monitoring components as a single Helm chart. Once installed, pods can request GPU resources with nvidia.com/gpu: 1 in their resource spec. The scheduler only places the pod on nodes where that resource is available, and the container runtime maps the physical GPU into the container.

For more granular allocation, NVIDIA Multi-Instance GPU (MIG) lets you partition an A100 or H100 into isolated slices — each with their own memory, cache, and compute engines. A single H100 can be split into seven 10GB MIG instances, each schedulable independently. This is useful for inference services that don't saturate a full GPU: instead of leaving 60% of an H100 idle while running a smaller model, you pack multiple inference replicas onto one card and pay for what you actually use.

Inference Serving with vLLM and Triton

vLLM has become the standard open-source inference engine for transformer-based language models. Its PagedAttention algorithm manages the KV cache as virtual memory pages, which eliminates fragmentation and enables continuous batching — incoming requests are dynamically batched as space becomes available rather than waiting for a fixed window. On typical LLM inference benchmarks, vLLM achieves 10–20x higher throughput than naive HuggingFace pipelines at equivalent latency targets.

NVIDIA Triton Inference Server complements vLLM for heterogeneous model serving — it supports TensorRT, ONNX, PyTorch, and TensorFlow backends under a single gRPC/HTTP API. Triton's model ensembles let you chain preprocessing, model execution, and postprocessing stages into a single pipeline with shared memory transfers between stages. For production deployments serving multiple model types simultaneously, Triton's multi-model server reduces operational overhead significantly compared to running separate services per model.

Storage Architecture for Model Weights

Model weights are large — a 70B parameter model in float16 occupies 140 GB. Loading from network storage on every pod start is impractical; a 10 Gbps NFS connection takes over two minutes to load 140 GB. The standard pattern is to pre-populate a node-local cache using a Kubernetes DaemonSet or init container, then mount the local cache as a hostPath or local PersistentVolume for inference pods.

For training, distributed storage that supports parallel reads from many workers simultaneously — such as GPFS, Lustre, or S3-compatible object storage with aggressive prefetching — prevents the storage layer from becoming a bottleneck. Frameworks like PyTorch's DataLoader with multiple workers and prefetch queues help, but they can only hide so much latency. Placing training datasets on SSDs local to the training nodes, rather than spinning HDD NAS, often yields a 3–5x improvement in data loading throughput for vision workloads with large image files.

Open Source Tools for the Full Stack

Kubeflow provides a platform for managing the end-to-end ML lifecycle on Kubernetes: pipelines for workflow orchestration, Katib for hyperparameter tuning, and KServe for model serving with autoscaling. KServe integrates directly with Knative, scaling inference deployments to zero when idle and back up within seconds when requests arrive — important for cost efficiency when serving many models with uneven traffic.

Prometheus and Grafana remain the standard for GPU observability. The NVIDIA DCGM Exporter exposes per-GPU metrics — utilization, memory used, temperature, power draw, NVLink bandwidth — in Prometheus format. Dashboards built on these metrics let operators spot underutilized GPUs, detect thermal throttling before it impacts performance, and track per-job GPU utilization for accurate cost attribution across teams.