Kimi AI Open-Sources FlashKDA: CUTLASS-Based Delta Attention Delivers 1.72×–2.22× Prefill Speedup on NVIDIA H20

Summary

Moonshot AI (Kimi) open-sources FlashKDA, a CUDA kernel implementation of their Delta Attention mechanism built on NVIDIA's CUTLASS library. The implementation achieves 1.72×–2.22× faster prefill compared to flash-linear-attention on H100/H20 hardware, while maintaining full drop-in backend compatibility with existing FLA-based projects. Developers can integrate FlashKDA with minimal code changes if already using flash-linear-attention.

Integration Strategy

When to Use This?

High-Value Use Cases:

Long-context LLM inference (16K+ tokens) where prefill time dominates latency
Batch inference workloads where throughput improvements compound across requests
Applications already dependent on flash-linear-attention seeking marginal gains
Deployment on H20-equipped infrastructure (common in Chinese datacenter deployments)

Lower Priority Scenarios:

Short-context applications where prefill is negligible
Situations where memory footprint is the primary constraint (Delta Attention requires additional state storage)
Environments limited to older GPU architectures without Hopper support

How to Integrate?

Migration Path from flash-linear-attention:

# Before (using FLA defaults)
from flash_linear_attention import flash_attn

# After (swapping to FlashKDA backend)
import flashkda  # or configuration flag in FLA
from flash_linear_attention import flash_attn  # Same API

Installation (inferred from typical CUDA library patterns):

pip install flashkda
# Requires: CUDA 11.8+ or 12.x, PyTorch 2.0+, NVIDIA GPU with compute capability 8.0+

The actual installation commands and specific requirements should be confirmed from the GitHub repository as exact build dependencies may vary.

Compatibility

Component	Expected Compatibility	Status
PyTorch	2.0+	Inferred
CUDA	11.8 / 12.x	Inferred
NVIDIA Architectures	Hopper (H100/H20), Ampere (A100)	Inferred from CUTLASS support
FLA Versions	Recent releases	Confirmed via "drop-in" claim
Triton Backend	Not applicable (CUTLASS-based)	N/A

Conclusion

FlashKDA represents a credible engineering contribution to the LLM inference optimization landscape. The combination of CUTLASS-backed kernels, FLA compatibility, and measured performance gains on H20 hardware addresses a real need for production deployments. The open-source release enables community scrutiny while the drop-in architecture lowers adoption barriers.

Verdict: Worth evaluating for projects already invested in flash-linear-attention and running on compatible hardware. Others should monitor for broader hardware support and independent benchmark validation.

Source: @Kimi_Moonshot
Repository: MoonshotAI/FlashKDA
Published: November 2024 (inferred from tweet ID)
DevRadar Analysis Date: 2026-04-21