Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks

The team behind Kimi.ai (Moonshot AI) just made a significant contribution to the open-source AI infrastructure space. The research team has made a significant contribution to the open-source AI infrastructure space. They released FlashKDA (Flash Kimi Delta Attention), a high-performance CUTLASS-based kernel implementation of the Kimi Delta Attention (KDA) mechanism. The FlashKDA library is available on GitHub under an MIT license. It delivers prefill speedups of 1.72× to 2.22× over the flash-linear-attention baseline on NVIDIA H20 GPUs, and works as a drop-in backend for the popular flash-linear-attention library.

What Is Kimi Delta Attention, and Why Does It Matter?

To understand FlashKDA, it helps to first understand where it sits in the LLM attention landscape.

Standard softmax attention has quadratic complexity with respect to sequence length — meaning that as you feed longer context into a model, compute costs grow extremely fast. This has driven a wave of research into linear attention mechanisms, which approximate or replace the softmax operation to achieve linear scaling. Kimi Delta Attention (KDA) is Moonshot AI’s contribution to this space: a linear attention mechanism that refines the Gated DeltaNet with a finer-grained, channel-wise gating mechanism, enabling more effective use of limited finite-state RNN memory.

KDA is not just a research prototype. It is the core attention mechanism in Kimi Linear, Moonshot AI’s open-source hybrid model with 48B total parameters and 3B activated parameters. Kimi Linear uses a 3:1 KDA-to-MLA (Multi-Head Latent Attention) ratio — three KDA layers for every one global attention layer — which reduces KV cache usage by up to 75% during long-sequence generation while achieving up to 6× higher decoding throughput at 1 million context length compared to full attention. FlashKDA is the production-grade CUDA kernel that makes that architecture fast during prefill.

Concretely, the KDA forward pass takes in queries (q), keys (k), values (v), a gate before activation (g), and beta logits (beta), along with a scale factor, an output tensor (out), and gate parameters: A_log (log-gate parameter per head), dt_bias (gate bias), and lower_bound (gate lower bound, ranging from -5.0 to 0). The sigmoid activation on beta is applied internally by the kernel. The mechanism also supports optional initial and final recurrent states — useful for multi-turn inference where you want to carry state across requests.

The recurrent formulation means the model can efficiently process long sequences during generation. But efficient prefill of these architectures still requires highly optimized GPU kernels — which is exactly what FlashKDA delivers.

Under the Hood: CUTLASS on Hopper

FlashKDA is built on CUTLASS, NVIDIA’s open-source library of CUDA C++ template abstractions for high-performance linear algebra and custom kernel development. CUTLASS allows developers to write kernels that take full advantage of NVIDIA’s Tensor Core architecture, and it’s the same foundation used by libraries like FlashAttention-3.

The library targets SM90 and above — meaning NVIDIA’s Hopper architecture (H100, H20) and newer. The minimum requirements are CUDA 12.9 and PyTorch 2.4. The codebase is predominantly CUDA (56.4%), with Python (36.2%) bindings and C++ (6.7%) glue code.

The core API is flash_kda.fwd, which takes the following inputs:

q, k, v, g: all in bf16 with shape [B, T, H, K] or [B, T, H, V] (where g is the gate before activation)

beta: bf16 beta logits in shape [B, T, H] (sigmoid applied internally)

scale: fp32 scalar scaling factor

out: bf16 output tensor in shape [B, T, H, V]

A_log, dt_bias, lower_bound: fp32 gate parameters

initial_state, final_state: optional bf16 or fp32 recurrent states

cu_seqlens: optional int64 cumulative sequence lengths for variable-length batching

One current constraint: the kernel requires K = V = 128 for head dimension.

The variable-length batching support via cu_seqlens is particularly notable for production use. In real inference serving, requests in a batch rarely share the same sequence length. Being able to pack multiple sequences of different lengths into a single kernel call is a key requirement for high-throughput serving systems.

Benchmark Results: 1.72× to 2.22× on H20

The benchmark results (as of April 20, 2026) compare flash_kda against fla_chunk_kda (the existing flash-linear-attention implementation) across a sequence length of T=8192, head dimension D=128, and two head count configurations: H=96 and H=64. Each benchmark ran with 30 warmup iterations, 200 measurement iterations, and 5 repeats.

For H=96:

Caseflash_kda (ms)fla_chunk_kda (ms)SpeedupFixed2.62194.50521.72×Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063]2.34204.57171.95×Varlen, seq_lens=1024 × 82.01004.46682.22×

For H=64:

Caseflash_kda (ms)fla_chunk_kda (ms)SpeedupFixed1.61992.95871.83×Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063]1.70273.05951.80×Varlen, seq_lens=1024 × 81.39303.04122.18×

The peak speedup of 2.22× appears in the uniform variable-length case (seq_lens=1024 × 8, eight sequences of length 1024 summing to T=8192). The fixed-length case delivers the floor of the range at 1.72×. Across both head configurations and all three sequence scenarios, FlashKDA consistently outperforms the flash-linear-attention baseline by a significant margin.

Integration with flash-linear-attention

One of the most practical aspects of FlashKDA is its integration story. Once installed, FlashKDA is auto-dispatched from flash-linear-attention’s chunk_kda — which means existing codebases using flash-linear-attention don’t need manual wiring to take advantage of the faster kernel. The integration is tracked in flash-linear-attention PR #852.

Installation is straightforward:

git clone https://github.com/MoonshotAI/FlashKDA.git flash-kda
cd flash-kda
git submodule update –init –recursive
pip install -v .

The correctness test suite (tests/test_fwd.py) runs exact-match verification against a PyTorch reference implementation and cross-validates against flash-linear-attention. This gives AI devs a reliable baseline for auditing kernel behavior before deploying in production.

Key Takeaways

FlashKDA is Moonshot AI’s open-source CUTLASS-based CUDA kernel for Kimi Delta Attention (KDA), delivering 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on NVIDIA H20 GPUs.

KDA extends Gated DeltaNet with fine-grained, channel-wise gating — it’s the core attention mechanism behind Kimi Linear, a 48B-total / 3B-active-parameter hybrid model that reduces KV cache usage by up to 75% and achieves up to 6× higher decoding throughput at 1M context length.

The kernel targets SM90+ hardware (NVIDIA Hopper — H100, H20 and above), requires CUDA 12.9+ and PyTorch 2.4+, and currently supports a fixed head dimension of K = V = 128.

Variable-length batching is natively supported via the cu_seqlens parameter, allowing multiple sequences of different lengths to be packed into a single kernel call — a critical feature for high-throughput inference serving.

Once installed, FlashKDA is auto-dispatched from flash-linear-attention‘s chunk_kda, making it a drop-in performance upgrade for any existing codebase already using the flash-linear-attention library — no architecture changes required.

Check out the GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks appeared first on MarkTechPost.

By

Leave a Reply

Your email address will not be published. Required fields are marked *