Zyphra AI has released ZAYA1-8B, a small Mix of Experts (MoE) language model with 760 million active parameters and 8.4 billion total parameters. This model has been exhaustively trained on AMD hardware, outperforms open-weight models several orders of magnitude in mathematics and cryptographic benchmarks, and is now available under the Apache 2.0 license on Hugging Face and as a serverless endpoint on Zyphra Cloud.
With fewer than a billion active parameters, ZAYA1-8B scores competitively with first-generation frontier inference models such as DeepSeek-R1-0528, Gemini-2.5-Pro, and Claude 4.5 Sonnet on challenging mathematical inference tasks. Thanks to a new test-time calculation methodology called Markovian RSA, it outperforms Claude 4.5 Sonnet and GPT-5-High on HMMT’25 (89.6 vs. 88.3) and comes close to frontier open-weight models like DeepSeek-V3.2 in the math benchmarks.
What is the expert mixture model and why is the number of active parameters important?
The distinction between “active” and “global” parameters is of great importance. In the standard dense model, each parameter is activated for each input symbol. In the expert mixture model, only a subset of network parameters-the “experts”-are activated at inference time. ZAYA1-8B has 8.4 billion total parameters but only 760 million active per forward pass. This dramatically reduces inference computation and memory bandwidth requirements while retaining the representational capacity of a much larger model.
The ZAYA1-8B can be deployed on-device for local LLM applications, run efficiently in compute tools at test time, and serve requests with lower latency compared to dense models with similar benchmark performance.


Architecture: Ministry of Education++ and three major innovations
The ZAYA1-8B is built on Zyphra’s MoE++ architecture, which offers… Three specific changes Based on the standard designs of the Ministry of Education. Together these elements form the intelligence efficiency base of ZAYA1-8B which represents the design goal of Zyphra frameworks as maximizing the intelligence extracted per parameter and per FLOP.
- Compressed Convolutional Attention (CCA)a serial mixing mechanism developed by Zyphra that operates in a compressed latent space and achieves 8× KV-cache compression versus standard attention. The KV cache is the memory used during inference to store intermediate attention states – an 8× reduction directly reduces memory requirements at inference time and allows for longer active contexts within the same hardware shell.
- ZAYA1 MLP based router with PID controller bias balancing. Standard MoE routers typically use linear predictions to determine which expert is processing a particular token. Zyphra replaces that with an MLP-based router and adds PID controller-style bias balancing to improve routing stability – effectively preventing load imbalance between experts, a known failure mode in MOE training.
- Learned residual scalingwhich controls the normalized residual growth through depth with a negligible parameter and the FLOP cost. In deep networks, residual flow parameters can grow unstable layer upon layer; Learned analogy addresses this without adding meaningful overhead.
Training Infrastructure: Completely built on AMD
ZAYA1-8B is a pre-trained, intermediate-trained, and fine-tuned MoE model on the AMD Instinct MI300 stack. The full training pipeline was run on a cluster of 1,024 AMD Instinct MI300x nodes connected via an AMD Pensando Pollara connection, in a custom training cluster created in collaboration with IBM.
Reflection – The first pre-training and five-stage post-training pipeline
ZAYA1-8B’s performance reflects innovations across the entire portfolio: Zyphra’s MoE++ architecture, Heuristics-first pre-training, RL cascade methodology, and the new Markovian RSA test-time calculation method.
Zyphra’s post-workout path consists of five consecutive phases:
- The first is a standard SFT phase covering basic chat, follow-up, code, mathematics, and test time calculation (TTC) capabilities.
- The second is an introductory inference process that combines mathematical tasks, logic, and puzzle solving, with TTC required to train the model on self-assembling candidate solutions.
- The third is the large RLVE-Gym stage with dynamically adjusted puzzle difficulty to train basic thinking circuits.
- The fourth is an extensive mathematics and RL code phase to deepen performance in these two core areas.
- Finally, the relatively lightweight RLHF/RLAIF phase improves chat behavior, instruction follow-up, and writing style.
The Zyphra research team observed the most significant boosts in abilities in mathematics and programming during learning to learn, with smaller but meaningful gains in multiple-choice knowledge retrieval (MMLU and GPQA-Diamond) and non-checkable tasks such as creative writing.
Markovian RSA: A new computational method for test time
The most important contribution from a technical standpoint besides the model is Markovian RSA, test time computation (TTC) A scheme that combines two previous ideas in a new way.
The first is Recursive self-assembly (RSA)which generates multiple inference traces in parallel and combines them recursively across iterations. The second is The idea of the Markovian thinkerwhich performs inference in chunks of fixed duration – only the end of the previous chunk is passed to the next chunk, keeping the context window restricted regardless of the duration of the model grounds.
Markovian RSA combines the following: for each router, multiple traces are generated in parallel; Fixed length tail parts are extracted from each trace; New aggregation claims are generated by subsampling the pool of candidates; These combined claims plant the seeds for the next round of parallel responses. The result has favorable inference properties-the proposition construction is parallelizable, and the Markovian hashing strategy ensures that the lengths of the average thought chain never exceed the fixed context window size.
A key finding is that co-design of the post-training methodology and the heuristics tool is essential. ZAYA1-8B is trained to understand and respond to Markovian RSA clustering and chunking prompts starting at SFT and continuing through RL. When Zyphra applied the same methodology to the Qwen3-4B-Thinking-2507 without this combined design, the performance lift was much smaller – suggesting that the belt and post-workout must be developed together to achieve gains.
With a Markovian RSA with a very high test-time compute budget of 5.5 million tokens per problem, ZAYA1-8B outperforms DeepSeek-V3.2 and GPT-OSS-High in the challenging APEX Shortlist Math test.
Standard results
In an in-class comparison with similarly sized models, the ZAYA1-8B scored 89.1 on AIME’26, 71.6 on HMMT February 26, 59.3 on IMO-AnswerBench, 32.2 on APEX Shortlist, 65.8 on LiveCodeBench-v6, and 71.0 on GPQA-Diamond – outperforming. Qwen3-4B-Thinking-2507 and Gemma-4-E4B-it in all mathematics and programming categories.
Against larger open-weight models, the ZAYA1-8B with 760M active parameters outperforms the Mistral-Small-4-119B (6B active, 119B total) in mathematics and programming benchmarks specifically – scoring 89.1 vs. 86.4 at AIME’26, 71.6 vs. 70.6 at HMMT Feb. 26, and 63.8 vs. 57.9 at LiveCodeBench-v6. Mistral-Small-4-119B retains the advantages of GPQA-Diamond (77.2 vs. 71.0) and MMLU-Pro (81.6 vs. 74.2), where breadth of knowledge matters more than depth of mathematical thinking.




Key takeaways
- ZAYA1-8B delivers frontier-level mathematical and coding performance with only 760 million active parameters, outperforming open-weight models several orders of magnitude higher.
- Its MoE++ architecture introduces three innovations – CCA with 8x kV cache compression, MLP-based router with PID controller bias equalization, and learned residual metric – to maximize the intelligence of each parameter.
- A new test-time computation method called Markovian RSA, combining recursive self-assembly and Markovian slicing, pushes ZAYA1-8B past DeepSeek-V3.2 and GPT-OSS-High on the APEX shortlist at a rate of 5.5 million codes per problem.
- ZAYA1-8B is the first MoE model to be pre-trained, average-trained, and fully SFT’d on AMD Instinct MI300 hardware – on a 1,024 MI300x node cluster built with IBM.
- Released under Apache 2.0, it is available on Hugging Face and Zyphra Cloud.
verify Paper, typical weights and Technical details. Also, feel free to follow us on twitter And don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.
Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us