nobody is talking about the NIC hop.
CXL memory eliminates the KV transfer bottleneck in disaggregated inference. 9.8x TTFT improvement. The plumbing paper nobody read.
April 10, 2026I've been deep in a rabbit hole of papers for three days and I want to tell you about one specific problem that I think is the most underappreciated bottleneck in disaggregated inference right now -- because solving it changes the economics of long-context serving in a way that matters.
You already know the setup. Disaggregated inference separates prefill and decode onto different hardware pools. Prefill is compute-bound, decode is memory-bandwidth-bound, they want different hardware, so you split them. NVIDIA Dynamo, vLLM's disaggregated mode, every serious inference team is either running this or planning to run it. The throughput numbers are real. The architecture is correct.
Here is the part that breaks it at long context, and it is embarrassingly simple once you see it.
After prefill finishes, you have a KV cache. All the key-value tensors for the input tokens -- the compressed representation of everything the model processed -- sitting in the prefill worker's GPU VRAM. The decode workers need it. They cannot generate a single token without it.
So how do you move it?
Over the network. RDMA. A NIC hop. The KV tensors go from GPU VRAM, through the PCIe bus to the host CPU's DRAM, out through the NIC, over the InfiniBand fabric, through the destination NIC, back through PCIe into the decode worker's DRAM, then GPU.
At short contexts this is fast enough. You don't notice it. The compute dominates.
At long contexts -- 15K tokens, 32K tokens, the context lengths that actually matter for the use cases driving Anthropic's $30B revenue number -- the KV transfer dominates total TTFT. Not contributes. Dominates. The time users are waiting for the first token is mostly spent moving KV tensors across a network fabric that was not designed for this.
The papers I've been reading are attacking this from a direction I didn't expect.
TraCT (December 2025, built on NVIDIA Dynamo and vLLM) and Beluga (Alibaba, November 2025) both make the same bet: eliminate the NIC hop entirely by putting a shared memory pool on the rack that both prefill and decode workers can access directly.
The technology is CXL -- Compute Express Link. An open interconnect standard built on the PCIe physical layer that allows CPUs, GPUs, and accelerators to access a shared memory pool with load/store semantics. Not a copy across a network. A direct memory access, the same way a GPU accesses its own VRAM, but pointed at a rack-scale pool of attached memory.
The numbers from TraCT: 9.8x reduction in average TTFT compared to RDMA transfer. 6.2x reduction in P99 latency. At 6000-token inputs, the improvement is the largest -- which is exactly the regime where long-context serving costs the most and matters the most.
Beluga's numbers on the same problem: 89.6% reduction in TTFT. 3.41x to 9.47x higher QPS on cache-hit runs compared to the RDMA-based MoonCake baseline.
These are not marginal improvements. These are the kind of numbers that show up when you were constrained by the wrong bottleneck the whole time and you finally eliminated it.
The part that took me a while to understand: CXL memory is not GPU VRAM. It's not as fast. The latency is 640 nanoseconds to access in typical CXL 2.0 deployments -- about 4-6x slower than local HBM. But it is dramatically cheaper (4-5x lower cost per GB than HBM), dramatically higher capacity (100+ terabyte pools in production now), and 200-500x lower latency than NVMe SSD.
What CXL actually creates is a new memory tier -- between GPU VRAM and CPU DRAM in latency, between CPU DRAM and NVMe in capacity. And for the specific use case of storing KV caches in disaggregated serving, the latency is fine. The KV tensors don't need HBM-speed access. They need to be there when the decode worker asks for them without the overhead of a full network round trip.
"But the PCIe bandwidth to access CXL is lower than--" It is. And it still beats RDMA for KV transfer because RDMA over 100Gbps InfiniBand delivers roughly 10-12 GB/s effective throughput to the receiving GPU, and CXL on PCIe 5.0 x16 delivers around 60 GB/s. The bandwidth advantage compounds with the eliminated NIC queuing overhead. The NIC is the bottleneck -- not because of raw bandwidth, but because of queuing, contention, and the variability that creates p99 spikes.
TraCT measures this directly: even without prefix reuse, just swapping the KV transfer path from RDMA to GPU-CXL DMA reduces TTFT and makes the latency distribution tighter. Tighter p99 is sometimes worth more than lower average in production, where SLO violations compound.
CXL 4.0 dropped in November 2025 -- the spec, not the hardware. It doubles the bandwidth to 128 GT/s via PCIe 7.0, introduces bundled ports that aggregate multiple connections into a single 1.5 TB/s logical link, and explicitly targets multi-rack memory pools. The production timeline the CXL Consortium is advertising: CXL 2.0 switches available now (XConn has the XC50256, which Alibaba used for Beluga), CXL 3.x deployments late 2026, CXL 4.0 multi-rack systems 2027+.
NVIDIA Blackwell supports CXL on the Grace CPU in Grace Hopper systems. AMD MI300X includes it through the CPU chiplet. The hardware integration is happening.
The thing I find genuinely interesting about all of this: the inference serving community spent 2024 and most of 2025 working on disaggregation -- how to split prefill and decode for better utilization. All of that work is correct and useful. And it created a new bottleneck that nobody was fully accounting for in the original architecture: the inter-worker KV transfer.
CXL addresses that bottleneck at the hardware level, without changing the inference framework architecture. TraCT integrates with Dynamo's disaggregated pipeline with a few lines of code change to vLLM's KV connector layer. That is a real property. The hardware does the work that the network was doing, faster and with lower variance.
The reason I am writing about this now is that most people who follow inference engineering closely are tracking the model-level stuff -- new architectures, quantization, speculative decoding. The memory interconnect papers don't get the same attention. They are hard to read, they assume familiarity with systems research, and the results are impressive but require enough context to interpret that most engineers skip them.
The skip is a mistake in this case. The NIC hop bottleneck in disaggregated serving is real and it gets worse as context windows grow -- which is the direction everything is going. The fix is coming in the hardware and it is already measurable in the research.
If you are planning inference infrastructure purchases for 2026 or 2027, CXL compatibility is worth putting on the evaluation checklist alongside GPU specs. The clusters that don't have CXL-capable rack architecture are going to look different from the ones that do, and the difference shows up in the p99 TTFT numbers for long-context workloads.
Which is where the users are.
the nic hop.
that's the bottleneck nobody in serving infrastructure is talking about.
9.8x average TTFT. 6.2x p99. from swapping one data transfer path for another.
it's always the bottleneck that looked like plumbing.
the hard part of this job is that the interesting breakthroughs are in papers nobody reads. this is one of them.
P.S. The CCCL paper (February 2026, Carnegie Mellon) goes further -- using CXL shared memory to replace RDMA for GPU collective operations (all-reduce, all-gather) entirely, not just KV transfer. Node-spanning GPU collectives without traditional networking. That is a different paper and a different rabbit hole but if you found this one interesting, go find that one.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.