the scheduler gave me eight GPUs. they were the wrong eight GPUs.
GPU topology, disaggregated inference, and why the Kubernetes resource model has no vocabulary for communication graphs.
August 28, 2025I have been thinking about a problem for about eight months and I think I finally understand what the problem actually is.
It is not the GPUs. It is not the scheduler. It is the abstraction.
Here is the thing I kept running into. You have a cluster. You need eight GPUs for a disaggregated inference deployment. You submit the job. Kubernetes finds eight available GPUs. It allocates them. The pods start. The job is slow. Not catastrophically slow. Inexplicably slow, in a way that takes a week to trace and does not obviously correlate with utilization metrics.
Then you run nvidia-smi topo -m and look at what you actually got.
Two GPUs on socket 0, connected to each other via NVLink. Three GPUs on socket 1, connected to each other via NVLink. Three more GPUs on a different node entirely, connected via PCIe to that node's fabric and to yours via InfiniBand. Kubernetes gave you eight GPUs. Eight different GPUs than the eight GPUs that would have made this job fast.
The scheduler requested a count. The hardware delivered a count. The topology was completely wrong.
This is the abstraction failure. The scheduler lives in a world where nvidia.com/gpu: 8 is a resource request. The physics of the hardware lives in a world where eight GPUs connected via NVLink is a completely different compute primitive from eight GPUs scattered across two NUMA domains and a network boundary. NVLink delivers 900 GB/s of bidirectional bandwidth between GPUs on the same node. PCIe Gen4 delivers about 64 GB/s. InfiniBand NDR delivers 400 Gbps, which is about 50 GB/s, with real-world effective throughput lower than that. You requested eight GPUs. You got eight GPUs. The communication paths between them are ten to eighteen times slower than what your job expected.
And NUMA makes it worse in a way that is invisible until you instrument it. Each socket has its own memory controller. CPU threads on socket 0 accessing memory attached to socket 1 go through the QPI interconnect. DMA transfers from a GPU on socket 0 to memory pinned to socket 1 do the same thing. These are not errors. They do not produce exceptions. They produce variance. p50 latency looks fine. p99 latency starts looking wrong. You add monitoring. You see the variance. You do not see why.
"But topology-aware scheduling handles—" For training workloads, mostly yes. There are label-based placement rules, node affinity policies, the NUMA topology manager in Kubernetes, custom scheduler plugins that score nodes based on NVLink domain membership. Those exist. They help.
For disaggregated inference, the problem is structurally different. And this is the part I have not seen stated clearly enough.
Disaggregated inference splits a single user request across two fundamentally different compute phases running on two different pools of hardware. The prefill phase processes the input prompt in parallel. Compute-bound. Needs tensor core throughput. H100 SXM with 989 TFLOPS of BF16. The decode phase generates tokens autoregressively. Memory-bandwidth-bound. Needs fast HBM access. Different optimization target. Different hardware preference.
These two phases are not independent. When the prefill phase finishes computing the key-value cache for a request, it has to transfer that cache to the decode worker that will generate the response. That transfer happens over whatever connects them. If they are on the same node, NVLink. If they are on different nodes, InfiniBand. The latency of that transfer directly determines time-to-first-token for the user.
The scheduler allocating these two pools separately, one after the other, through standard pod placement, can put the prefill workers and decode workers anywhere in the cluster. They might end up with fast interconnects. They might end up with slow ones. The scheduler does not know the difference because no one told it to optimize for KV cache transfer latency between the two pools. The transfer path is not a resource in the Kubernetes resource model.
So you get a situation where the prefill cluster is fast and the decode cluster is fast and the path between them is slow and the whole system underperforms for reasons that are not visible in either cluster's health metrics.
This is the gap I have been staring at for eight months.
The insight I keep coming back to: the atomic unit of allocation in a disaggregated inference deployment is not a GPU. It is a serving topology.
A serving topology for a large-model disaggregated deployment is: a prefill pool of N compute-optimized GPUs, all within the same NVLink domain, with enough tensor core throughput to process the expected prompt distribution within the TTFT SLO. Plus a decode pool of M bandwidth-optimized GPUs, also NVLink-connected within their pool, with enough HBM bandwidth to generate tokens within the ITL SLO. Plus a transfer path between the two pools with enough bandwidth to move KV cache tensors without becoming the bottleneck. Plus a router that is aware of the KV cache state in the decode pool so it can route requests to workers that already hold relevant cached context.
That entire structure needs to be instantiated as a unit. Not as four separate resource requests that the scheduler resolves independently. As one atomic allocation that the scheduler either places correctly or defers until it can.
This is gang scheduling extended to topology-aware serving graphs. Not just "launch all the pods together" but "launch all the pods together with a placement that satisfies the communication constraints of the graph they form."
NVIDIA Dynamo is building toward this. The Planner component monitors KV cache pressure and prefill queue depth in real time and shifts GPU resources between pools proactively before SLOs are violated. Run:ai's gang scheduler treats the entire serving deployment as an atomic unit. These are real steps in the right direction.
But the scheduler still does not have native vocabulary for "I need a prefill-to-decode transfer path of at least 400 GB/s." That constraint lives outside the resource model. It gets encoded as node affinity rules and topology labels, which are workarounds for an abstraction that does not yet exist.
The abstraction that should exist: a resource type that represents a topology-compliant serving pipeline. Not a set of GPU counts but a specification of the communication graph: prefill pool bandwidth, decode pool bandwidth, inter-pool transfer capacity, router placement relative to both. You request the graph, not the hardware. The scheduler figures out which physical configuration satisfies it.
Until that exists, GPU orchestration for disaggregated inference is a manual process of translating communication requirements into placement hints and hoping the scheduler respects them. It mostly works. It wastes twenty to thirty percent of cluster capacity on placements that look valid and run slow. It produces p99 variance that takes weeks to diagnose.
I am working on what the type system for this looks like. I do not have it fully yet. I know what it needs to express. The question is what the API surface looks like that makes these constraints schedulable without requiring operators to encode the entire network topology of their cluster in YAML affinity rules.
If you have been thinking about this from a different angle I would genuinely like to compare notes.
the scheduler gave me eight gpus.
they were the wrong eight gpus.
not wrong in a way it knew. not wrong in a way anyone's dashboard caught. wrong in a way that only showed up in the p99 of the inter-pool KV cache transfer, which is not a metric anyone had configured because it was not a resource anyone had named.
the problem is not the hardware. the problem is that we have not built a type system for what the hardware needs to express.
you cannot schedule a communication graph if communication is not in the resource model.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.