Kimi Cross-Datacenter Prefill/Decode Disaggregation
Kimi.ai presents a production implementation of Prefill/Decode disaggregation across datacenters with heterogeneous hardware, addressing the historical bottleneck of KV cache transfer overhead. The solution leverages Kimi Linear, a hybrid model architecture that significantly reduces KV cache size, making cross-datacenter PD disaggregation economically viable. Benchmarks on a 20x scaled-up model demonstrate 1.54× throughput improvement and 64% reduction in P90 Time-to-First-Token, directly impacting token cost economics.
Kimi Cross-Datacenter Prefill/Decode Disaggregation
Kimi.ai extends Prefill/Decode (PD) disaggregation beyond single-cluster boundaries to cross-datacenter deployments with heterogeneous hardware, leveraging their Kimi Linear hybrid model to reduce KV cache overhead. Benchmarks on a 20x scaled model show 1.54× throughput improvement and 64% reduction in P90 TTFT, translating to direct token cost savings.
Integration Strategy
When to Use This?
Ideal for:
- High-volume inference providers seeking infrastructure cost reduction
- Global AI applications with users distributed across regions
- Multi-tenant LLM services that can share prefill infrastructure
- Long-context applications where KV cache efficiency directly impacts cost
Not suitable for:
- Single-region, latency-ultra-sensitive applications where any cross-DC hop is unacceptable
- Organizations without existing multi-datacenter infrastructure
- Low-volume deployments where disaggregation complexity exceeds cost savings
How to Integrate?
Current Status: Research/publication stage (arxiv:2604.15039v1). Production availability not confirmed.
Likely Integration Path:
- API-level abstraction for "Prefill-as-a-Service"
- SDK for specifying datacenter preferences
- SDK for hybrid local/cloud deployment
Note: Specific SDK availability, API specifications, and migration paths from existing architectures are not documented in the source material.
Compatibility
Inferred Support Matrix:
| Component | Status |
|---|---|
| PyTorch | Likely (industry standard) |
| CUDA versions | Not disclosed |
| Kubernetes orchestration | Expected for production scale |
| Existing vLLM/TensorRT-LLM pipelines | Unknown—may require adaptation |
Source: @Kimi_Moonshot Reference: Prefill-as-a-Service arXiv:2604.15039v1 DevRadar Analysis Date: 2026-04-20