DevRadar
🌐 Kimi MoonshotSignificant

Kimi AI Open-Sources FlashKDA: CUTLASS-Based Delta Attention Delivers 1.72×–2.22× Prefill Speedup on NVIDIA H20

Kimi.ai open-sources FlashKDA, a CUTLASS-based implementation of Kimi Delta Attention kernels designed for high-performance LLM inference. The implementation delivers 1.72x-2.22x prefill speedup on NVIDIA H20 hardware compared to the flash-linear-attention baseline. Key technical aspects include full drop-in backend compatibility with the flash-linear-attention library, enabling straightforward integration for developers currently using FLA. This represents a concrete optimization path for accelerating transformer inference using Delta Attention mechanisms.

Kimi.aiTuesday, April 21, 2026Original source

Kimi AI Open-Sources FlashKDA: CUTLASS-Based Delta Attention Delivers 1.72×–2.22× Prefill Speedup on NVIDIA H20

Summary

Moonshot AI (Kimi) open-sources FlashKDA, a CUDA kernel implementation of their Delta Attention mechanism built on NVIDIA's CUTLASS library. The implementation achieves 1.72×–2.22× faster prefill compared to flash-linear-attention on H100/H20 hardware, while maintaining full drop-in backend compatibility with existing FLA-based projects. Developers can integrate FlashKDA with minimal code changes if already using flash-linear-attention.

Integration Strategy

When to Use This?

High-Value Use Cases:

  • Long-context LLM inference (16K+ tokens) where prefill time dominates latency
  • Batch inference workloads where throughput improvements compound across requests
  • Applications already dependent on flash-linear-attention seeking marginal gains
  • Deployment on H20-equipped infrastructure (common in Chinese datacenter deployments)

Lower Priority Scenarios:

  • Short-context applications where prefill is negligible
  • Situations where memory footprint is the primary constraint (Delta Attention requires additional state storage)
  • Environments limited to older GPU architectures without Hopper support

How to Integrate?

Migration Path from flash-linear-attention:

# Before (using FLA defaults)
from flash_linear_attention import flash_attn

# After (swapping to FlashKDA backend)
import flashkda  # or configuration flag in FLA
from flash_linear_attention import flash_attn  # Same API

Installation (inferred from typical CUDA library patterns):

pip install flashkda
# Requires: CUDA 11.8+ or 12.x, PyTorch 2.0+, NVIDIA GPU with compute capability 8.0+

The actual installation commands and specific requirements should be confirmed from the GitHub repository as exact build dependencies may vary.

Compatibility

ComponentExpected CompatibilityStatus
PyTorch2.0+Inferred
CUDA11.8 / 12.xInferred
NVIDIA ArchitecturesHopper (H100/H20), Ampere (A100)Inferred from CUTLASS support
FLA VersionsRecent releasesConfirmed via "drop-in" claim
Triton BackendNot applicable (CUTLASS-based)N/A

Conclusion

FlashKDA represents a credible engineering contribution to the LLM inference optimization landscape. The combination of CUTLASS-backed kernels, FLA compatibility, and measured performance gains on H20 hardware addresses a real need for production deployments. The open-source release enables community scrutiny while the drop-in architecture lowers adoption barriers.

Verdict: Worth evaluating for projects already invested in flash-linear-attention and running on compatible hardware. Others should monitor for broader hardware support and independent benchmark validation.


Source: @Kimi_Moonshot
Repository: MoonshotAI/FlashKDA
Published: November 2024 (inferred from tweet ID)
DevRadar Analysis Date: 2026-04-21