Moonshot AI open source FlashKDA: CUTLASS kernels for Kimi Delta interest with variable length clusters and H20 benchmarks

The team behind Kimi.ai (Moonshot AI) has made a significant contribution to the field of open source AI infrastructure. The research team has made a significant contribution to the field of open source AI infrastructure. They released them Flashda (Flash Kimi Delta Attention), a high-performance kernel implementation based on CUTLASS for Kimi Delta Attention (KDA) mechanism. the Flashda The library is available on GitHub under the MIT License. Provides pre-fill speedups 1.72× to 2.22× on flash-linear-attention Baseline on NVIDIA H20 GPUs, and acts as a drop-in backend for popular systems flash-linear-attention library.

What is chemo delta attention, and why is it important?

To understand FlashKDA, it is helpful to first understand its place in the landscape of interest in LLM.

Standard softmax interest has a quadratic complexity with respect to sequence length – which means that when you feed a longer context into the model, the computing costs grow very quickly. This has prompted a wave of research Linear attention Mechanisms that approximate or replace the softmax process to achieve linear scaling. Kimi Delta Attention (KDA) is Moonshot AI’s contribution to this space: a linear attention mechanism that improves… Delta Net portal With fine grains, Channel wise portal A mechanism that allows more efficient use of finite-state memory RNNs.

KDA is not just a research model. It is the basic mechanism of attention in Linear Kimi,The Moonshot AI hybrid model is open source with a total of 48B ,parameters and 3B active parameters. Kimi Linear uses a 3:1 KDA to MLA (Multi-Head Latent Attention) ratio – three KDA layers for each global attention layer – reducing KV cache usage by up to 75% during long sequence generation while achieving up to 6x higher decoding throughput at 1 million contexts compared to full attention. Flashda It is a production-specific CUDA kernel that makes this build fast during prepackaging.

Concretely, the KDA forward pass receives queries (q), keys (k), values (v), gate before activation (g), and beta records (beta), along with A scale Output tensor factor (out), and gate parameters: A_log (log gate parameter for each head), dt_bias (gate bias), and lower_bound (Gate lower bounds, range from -5.0 to 0). Activate sigmoid on beta It is implemented internally by the kernel. The mechanism also supports optional initial and final recursive states – useful for multi-turn inference where you want to move state across requests.

The recursive format means that the model can handle long sequences efficiently during generation. But it’s effective Pre-packing These architectures still require a highly optimized GPU core – which is exactly what FlashKDA offers.

UNDER THE HOOD: CUTLASS ON HOPPER

FlashKDA is built on top of it Moving swordNVIDIA’s open source library of CUDA C++ template abstractions for high-performance linear algebra and custom kernel development. CUTLASS allows developers to write a kernel that takes full advantage of NVIDIA’s Tensor Core architecture, the same core used by libraries like FlashAttention-3.

Library objectives SM90 and above – This means NVIDIA Hopper architecture (H100, H20) and newer. The minimum requirements are CUDA 12.9 and PyTorch 2.4. The code base is mostly CUDA (56.4%), with Python binders (36.2%) and C++ glue code (6.7%).

The basic API is flash_kda.fwdwhich takes the following inputs:

q, k, v, g: All in bf16 With shape [B, T, H, K] or [B, T, H, V] (where g It is the gateway before activation)
beta: Log beta bf16 in the figure [B, T, H] (Sigmoid applied internally)
scale: numerical scaling factor fp32
out: The output tensor bf16 in the figure [B, T, H, V]
A_log, dt_bias, lower_bound: FP32 gate parameters
initial_state, final_state: Recurrent cases BF16 or optional FP32
cu_seqlens: Optional int64 cumulative sequence lengths for Variable length mixtures

One current limitation: requires the kernel K = V = 128 To the distance of the head.

Support variable-length compilation via cu_seqlens Particularly notable for production use. In a real inference service, requests in a batch rarely share the same sequence length. The ability to package multiple sequences of different lengths into a single kernel call is a basic requirement for high-throughput service systems.

Standard results: 1.72x to 2.22x on H20

Compare benchmark results (as of April 20, 2026). flash_kda against fla_chunk_kda (existing flash-linear-attention implementation) across the length of a sequence T=8192head dimension D=128And two configurations for the number of heads: H=96 and H=64. Each criterion was run using 30 warm-up repetitions, 200 measurement repetitions, and 5 reps.

to H=96:

issue	`flash_kda` (Ms)	`fla_chunk_kda` (Ms)	acceleration
Pinned	2.6219	4.5052	1.72×
farlin, `seq_lens`=[1300, 547, 2048, 963, 271, 3063]	2.3420	4.5717	1.95×
farlin, `seq_lens`=`1024 × 8`	2.0100	4.4668	2.22×

to H=64:

issue	`flash_kda` (Ms)	`fla_chunk_kda` (Ms)	acceleration
Pinned	1.6199	2.9587	1.83×
farlin, `seq_lens`=[1300, 547, 2048, 963, 271, 3063]	1.7027	3.0595	1.80×
farlin, `seq_lens`=`1024 × 8`	1.3930	3.0412	2.18×

A peak velocity of 2.22× appears in the case with uniform variable length (seq_lens=1024 × 8eight sequences of length 1024 sum to T = 8192). The fixed-length case provides the range floor at 1.72x. Across both header configurations and all three serialization scenarios, FlashKDA consistently outperforms flash-linear-attention Baseline by a large margin.

Integration with linear flash attention

One of the most practical aspects of FlashKDA is the integration story. Once installed, it becomes FlashKDA Automatically sent from linear flash attention chunk_kda – Which means using existing code bases flash-linear-attention You don’t need manual connections to take advantage of the faster kernel. The integration is traced in flash-linear-attention PR 852.

Installation is straightforward:

git clone https://github.com/MoonshotAI/FlashKDA.git flash-kda
cd flash-kda
git submodule update --init --recursive
pip install -v .

Correctness test set (tests/test_fwd.py) performs exact match checking against the PyTorch reference implementation and cross-validation against flash-linear-attention. This gives AI developers a reliable baseline for reviewing kernel behavior before deploying to production.

Key takeaways

FlashKDA is an open source CUDA kernel built on top of Moonshot AI’s CUTLASS For Kimi Delta Attention (KDA), delivery 1.72×-2.22× prefill acceleration on flash-linear-attention Baseline on NVIDIA H20 GPUs.
KDA extends Gated DeltaNet with micro- and channel-wise gateways – It is the core mechanism of interest behind Kimi Linear, an active-parameterized 48B/3B hybrid model that reduces KV cache usage by up to 75% and achieves up to 6x higher decoding throughput with a 1M context length.
The kernel targets SM90+ devices (NVIDIA Hopper – H100, H20 and above), requires CUDA 12.9+ and PyTorch 2.4+, currently supports fixed vertex dimension K = V = 128.
Variable length assembly is supported natively via cu_seqlens parameter, allowing multiple sequences of different lengths to be compiled into a single kernel call – an important feature for a high-throughput inference service.
Once installed, FlashKDA is automatically sent from flash-linear-attention‘s chunk_kdamaking it a direct performance upgrade for any existing code base you’re already using flash-linear-attention Library – No build changes required.

verify GitHub repo. Also, feel free to follow us on twitter Don’t forget to join us 130k+ ml SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us

Moonshot AI open source FlashKDA: CUTLASS kernels for Kimi Delta interest with variable length clusters and H20 benchmarks