Kimi AI Open-Sources FlashKDA: CUTLASS-Based Delta Attention Delivers 1.72×–2.22× Prefill Speedup on NVIDIA H20
Kimi.ai open-sources FlashKDA, a CUTLASS-based implementation of Kimi Delta Attention kernels designed for high-performance LLM inference. The implementation delivers 1.72x-2.22x prefill speedup on NVIDIA H20 hardware compared to the flash-linear-attention baseline. Key technical aspects include full drop-in backend compatibility with the flash-linear-attention library, enabling straightforward integration for developers currently using FLA. This represents a concrete optimization path for accelerating transformer inference using Delta Attention mechanisms.
Kimi AI Open-Sources FlashKDA: CUTLASS-Based Delta Attention Delivers 1.72×–2.22× Prefill Speedup on NVIDIA H20
Moonshot AI (Kimi) open-sources FlashKDA, a CUDA kernel implementation of their Delta Attention mechanism built on NVIDIA's CUTLASS library. The implementation achieves 1.72×–2.22× faster prefill compared to flash-linear-attention on H100/H20 hardware, while maintaining full drop-in backend compatibility with existing FLA-based projects. Developers can integrate FlashKDA with minimal code changes if already using flash-linear-attention.
Integration Strategy
When to Use This?
High-Value Use Cases:
- Long-context LLM inference (16K+ tokens) where prefill time dominates latency
- Batch inference workloads where throughput improvements compound across requests
- Applications already dependent on flash-linear-attention seeking marginal gains
- Deployment on H20-equipped infrastructure (common in Chinese datacenter deployments)
Lower Priority Scenarios:
- Short-context applications where prefill is negligible
- Situations where memory footprint is the primary constraint (Delta Attention requires additional state storage)
- Environments limited to older GPU architectures without Hopper support
How to Integrate?
Migration Path from flash-linear-attention:
# Before (using FLA defaults)
from flash_linear_attention import flash_attn
# After (swapping to FlashKDA backend)
import flashkda # or configuration flag in FLA
from flash_linear_attention import flash_attn # Same API
Installation (inferred from typical CUDA library patterns):
pip install flashkda
# Requires: CUDA 11.8+ or 12.x, PyTorch 2.0+, NVIDIA GPU with compute capability 8.0+
The actual installation commands and specific requirements should be confirmed from the GitHub repository as exact build dependencies may vary.
Compatibility
| Component | Expected Compatibility | Status |
|---|---|---|
| PyTorch | 2.0+ | Inferred |
| CUDA | 11.8 / 12.x | Inferred |
| NVIDIA Architectures | Hopper (H100/H20), Ampere (A100) | Inferred from CUTLASS support |
| FLA Versions | Recent releases | Confirmed via "drop-in" claim |
| Triton Backend | Not applicable (CUTLASS-based) | N/A |
Conclusion
FlashKDA represents a credible engineering contribution to the LLM inference optimization landscape. The combination of CUTLASS-backed kernels, FLA compatibility, and measured performance gains on H20 hardware addresses a real need for production deployments. The open-source release enables community scrutiny while the drop-in architecture lowers adoption barriers.
Verdict: Worth evaluating for projects already invested in flash-linear-attention and running on compatible hardware. Others should monitor for broader hardware support and independent benchmark validation.
Source: @Kimi_Moonshot
Repository: MoonshotAI/FlashKDA
Published: November 2024 (inferred from tweet ID)
DevRadar Analysis Date: 2026-04-21