Efficiency of inference has quietly become one of the most critical bottlenecks in AI deployment. As proxy coding systems such as Claude Code, Codex, and Cursor expand from developer tools into the infrastructure that supports software development in general, the underlying inference engines that serve these requests are under increasing pressure. the Lightseek Foundation Researchers launched Code speedan open source LLM inference engine released under the MIT license and designed specifically for the requirements of agent workloads. the Code speed The engine is currently in inspection status.
Why is agentic inference a different problem?
To understand what makes TokenSpeed’s design choices meaningful, it is helpful to understand what makes agentic inference difficult. Programming agents do not act like a typical chatbot role. Contexts routinely exceed 50,000 characters, and conversations often span dozens of turns. This creates concurrent pressure on two metrics: per-GPU TPM (tokens per minute), which determines how many users a single GPU can serve, and per-user TPS (tokens per second), which determines whether an individual user views the system as responsive. Most common standards do not fully capture this behavior.
TokenSpeed is designed to maximize both. The goal is to increase the TPM per GPU while maintaining a per-user TPS limit – typically 70 TPS, sometimes 200 TPS or higher.
Architecture: five interlocking subsystems
The TokenSpeed architecture is built around five design pillars: a compiler-supported modeling mechanism for parallelism, high-performance scheduling, safe KV resource reuse restriction, a multilayer pluggable kernel system supporting heterogeneous accelerators, and SMG integration for a low-overhead request entry point on the CPU side.
the Modeling layer It uses a local SPMD (single program, multiple data) approach. SPMD is a parallel execution model in which all processes run the same program but on different subsets of the data – a common pattern in distributed deep learning. Instead of requiring developers to manually implement inter-process communication logic, TokenSpeed enables developers to specify I/O position annotations at unit boundaries, and then a lightweight static compiler automatically generates the required collective operations during model generation, eliminating the need to manually implement the communication logic.
the Scheduling It creates a structural split between the control plane and the execution plane. The C++ control plane is implemented as a finite state machine that works with the type system to enforce safe resource management-including KV cache state transfer and use-at compile time rather than runtime. The request lifecycle, KV cache resources, and nested timing are represented by explicit FSM transitions and ownership semantics, so correctness is enforced by a verifiable control system rather than by imitation. By encoding these correctness constraints into the type system rather than leaving them to runtime imitations, errors in the KV cache management-one of the most error-prone areas of the LLM service-are caught earlier. The implementation level in Python is implemented to maintain development efficiency, allowing faster iteration of features and reducing cognitive load for developers
the Kernel layer It treats the GPU core as a first-class modular subsystem rather than integrating it into the engine core. It provides a portable public API, a centralized registration and selection model, and an extensible plug-in mechanism to support heterogeneous accelerators – meaning it is not restricted to NVIDIA hardware. The development team has also developed one of the fastest games MLA (Multi-headed Latent Attention) kernel. For proxy workloads on NVIDIA Blackwell. In the decoder kernel, q_seqlen and num_heads are compiled to take full advantage of Tensor Cores, since num_heads are small in some of these use cases. The binary pre-packaging kernel has a fine-tuned softmax implementation. Notably, TokenSpeed MLA has been certified by vLLM.


Finally, TokenSpeed is integrated SMG – Native component of PyTorch – for a low CPU overhead request entry point, reducing the handover cost between CPU coordination and GPU execution.
Benchmark results against TensorRT-LLM on NVIDIA B200
It should be noted up front that these standards cover single (non-classified) publishing only. PD rating support is still being cleaned up and may be covered in a dedicated follow-up from Code speed a team.
In collaboration with the EvalScope team, TokenSpeed was evaluated against SWE-smith traces, which closely reflect the movement of a production coding agent, and was benchmarked against TensorRT-LLM – the current state of the art on NVIDIA Blackwell. The test model was the Kimi K2.5.
For token agents running at >70 TPS/user, the best configuration is Attention TP4 + MoE TP4, where TokenSpeed dominates TensorRT-LLM across the entire Pareto frontier: ~9% faster at the lowest latency case (batch size 1), ~11% higher throughput around 100 TPS/user. TP4 here stands for tensor parallelism across 4 GPUs, a technique that partitions model weights across multiple devices to reduce memory pressure and latency per device.
In the MLA kernel, the gains are most pronounced at the decoding stage. The decoder kernel collapses the query sequence axis into the vertex axis to better fill BMM1 M tiles, and improved use of Tensor Core. The binary-version prepack kernel uses internal NVIDIA knobs to fine-tune the softmax implementation, outperforming TensorRT-LLM’s MLA across all five prepack workloads typical of codec agents with a long prefix KV cache. Combined with other improvements, this reduces latency by approximately half compared to TensorRT-LLM in typical decoding workloads with speculative decoding at batch sizes of 4, 8, and 16 with a long KV prefix cache.
Key takeaways
- Code speed It is a new MIT-licensed, open source LLM inference engine from the LightSeek Foundation, designed specifically for proxy workloads. (Available in preview mode)
- Scheduled for her It uses a C++ finite state machine to enforce KV cache safety at compile time, while maintaining the implementation level in Python for ease of use.
- On Nvidia B200,TokenSpeed outperforms TensorRT-LLM by ~9% in minimum latency and ,~11% in throughput at 100 TPS/user on Kimi K2.5.
- TokenSpeed MLA kernel Nearly half the decoding latency compared to TensorRT-LLM in speculative decoding workloads has already been adopted by vLLM.
verify Technical details and GitHub repo. Also, feel free to follow us on twitter And don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.
Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us