Cohere releases Command A+: 218B MoE sparse model for proxy workflow running on a few H100 GPUs

Cohere has just released Command A+, as an open source model targeting enterprise proxy workflows. Available under the Apache 2.0 license, Command A+ is a mixture of experts (MoE) model designed for high-performance proxy tasks with minimal computational overhead. The model is optimized for inference, agent workflow, RAG, and multilingual and multimedia document processing. It unifies capabilities from four previous models – Command A, Command A, Command A Vision, and Command A Translation – into one scalable model.

Build

The A+ command is an expert sparse converter for the decoder only with a total of 218B parameters and 25B active parameters. It has 128 experts, 8 of which are active for each token, and one common expert is applied to all tokens. In the MoE model, each code is routed through only a subset of specialized subnetworks rather than the full set of parameters, keeping active computation at a scale of 25B parameters at inference time.

The attention layers overlap the sliding window attention layers with rotational embeddings and the global attention layers without positional embeddings in a ratio of 3:1. The sparse MoE layer is trained in a completely non-falling manner and uses a vector for token selection, with a normalized sigmoid over the top-k expert records for each token.

The input methods are text, image and tool usage. The output methods are text, logic, and tool use. The model supports an input context length of 128 KB and a maximum generation length of 64 KB.

Hardware and quantization requirements

Three quantization variants are available with minimum GPU requirements: BF16 (16-bit) requires 4 x B200 or 8 x H100 GPUs; FP8 (8-bit) requires 2x B200 or 4x H100 GPUs; W4A4 (4-bit) runs on a single B200 or 2×H100 GPU. All three quantities show negligible differences in standard quality. Cohere recommends W4A4 for most deployments.

W4A4 Quantization methodology

Cohere applies NVFP4 W4A4 quantization, 4-bit weights, and activations with two-level scaling, to MOE experts only. The attention trajectory, including Q/K/V/O projections, KV cache, and attention computation, is maintained with full precision.

To fill remaining quality gaps, Cohere uses a post-training quantization distillation (QAD) technique: a student quantile model is trained to fit the distribution of the teacher output at full accuracy, using pseudoquantization factors in the forward pass and direct estimators in the backward pass.

Performance Models vs. Advance Orders a

At τ²-Bench Telecom, results improved from 37% to 85% compared to command A heuristics, and Terminal-Bench static proxy encryption performance reached 25% from 3%.

In North’s internal evaluations, all scores were scored using LLM-as-a-judge techniques, and the accuracy of answering agent questions improved by 20% compared to logical reasoning. Agentic QA measures how well a model answers an organization’s questions using MCP-connected cloud file systems. The quality of spreadsheet analysis improved by 32%, and the quality of memory utilization – measuring how well an agent leverages information from a previous session to answer questions in a later session – scored 54% with the A+ command compared to 39% with the A Reasoning command.

Command A+ is Cohere’s first multi-modal reasoning model. I achieved 63% in MMMU Pro and 75.1% in MMMU, compared to 65.3% in Command A Vision on the latter. MathVista scores improved from 73.5% to 80.6%, and CharXiv Logic improved from 46.9% to 52.7%.

Command A+ expands multilingual coverage from 23 to 48 languages, with gains in machine translation and multilingual thinking.

Command A+ received a score of 37 on the AI Index, beating other leading open models.

Speed and latency

At the same levels of quantization and concurrency, Command A+ provides up to 63% higher output tokens per second (TOPS) and reduces the Time to First Token Appearance (TTFT) by up to 17% compared to Command A inference. W4A4 quantization contributes to an additional 47% increase in speed and a 13% reduction in latency. Speculative decoding, specifically optimized for the MoE architecture, provides an additional 1.5-1.6× inference speedup for both text and multimedia inputs.

Token

Command A+ is the first model to use the latest Cohere token, reducing the number of tokens required to generate the same response. Coding efficiency improved by 20% for Arabic, 16% for Korean, and 18% for Japanese.

Never

The model is supported by vLLM and Transformers. Tool usage is handled by Transformers chat templates using a JSON schema for tool descriptions. When heuristics is enabled, the model creates between-reasoning traces andlt;|START_THINKING|andgt; and andlt;|END_THINKING|andgt; marks before producing the final answer.

W4A4 variant requires vLLM ≥0.21.0 and cohere_melodyandgt;=0.9.0 For accurate response analysis. Cohere recommends the following sampling parameters: temperature=0.9, top_p=0.95and repetition_penalty=1.04.

Key takeaways

The A+ command has 218 billion total / 25 billion active parameters in the Sparse MoE architecture, which was released under Apache 2.0.
W4A4 applies NVFP4 quantification to MOE experts only with subsequent QAD training, running on 2 x H100s.
τ²-Bench Telecom improved from 37% to 85%; Peripheral seat power from 3% to 25% versus logical reasoning.
TOPS increased by up to 63% and TTFT decreased by up to 17% versus A-order heuristics when matching quantization.
Command A+ is Cohere’s first multimodal inference model, expanding language support from 23 to 48 languages.

verify Typical weights and Technical details. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us

Michel Sutter is a data science specialist and holds a Master’s degree in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michelle excels at transforming complex data sets into actionable insights.

Cohere releases Command A+: 218B MoE sparse model for proxy workflow running on a few H100 GPUs

Build

Hardware and quantization requirements

W4A4 Quantization methodology

Performance Models vs. Advance Orders a

Speed and latency

Token

Never

Key takeaways

Like this:

Related

Like this:

Like this:

Like this:

Leave a ReplyCancel reply

ZAILLUSION

News

Legal

Build

Hardware and quantization requirements

W4A4 Quantization methodology

Performance Models vs. Advance Orders a

Speed ​​and latency

Token

Never

Key takeaways

Share this:

Like this:

Related

Related Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a ReplyCancel reply

Speed and latency