IBM has released two new open speech recognition models – Granite Letter 4.1 2b and Granite Letter 4.1 2B-NAR -And they make a compelling case for what a 2B-parameterized speech model can do. Both are available on Hugging Face under the Apache 2.0 license.
The duo targets a specific problem that enterprise AI teams know well: most production-level automatic speech recognition (ASR) systems either require massive compute or sacrifice accuracy to stay within budget. IBM’s bet is that careful architectural decisions can let you have it both ways.
What these models actually do
Granite Letter 4.1 2b It is a compact and powerful speech language model designed for multilingual automatic speech recognition (ASR) and two-way automatic speech translation (AST) covering English, French, German, Spanish, Portuguese and Japanese. Its non-autoregressive counterpart, Granite Letter 4.1 2B-NARfocuses exclusively on ASR – specifically targeting latency-sensitive deployments – and supports English, French, German, Spanish, and Portuguese, but not Japanese. This is a useful distinction: teams that need Japanese transcription or any speech translation ability should reach for a standard autoregressive model.

IBM has also quietly released a third variant alongside these two. Granite Speech 4.1 2B-Plus adds speaker-attributed ASR and word-level timestamps for applications that require knowing exactly who said what and when.
Word Error Rate (WER) is the primary metric for measuring copy quality. Less is better. A WER of 5% means that approximately 5 out of every 100 words are wrong. On the Open ASR leaderboard (as of April 2026), the Granite Speech 4.1 2B has an average WER of 5.33. Digging into the details of the benchmark – on LibriSpeech Clean, the model achieves a WER of 1.33, and 2.5 on other LibriSpeech.
Explanation of architecture
Both models share the same three-component high-level design-speech encoder, modality converter, and language model-although the decoder mechanism is significantly different.
the First component It is a speech encryption program. The architecture uses 16 matching blocks trained using Contact Temporal Classification (CTC) with two classification heads – one for graphical (character-level) outputs and one for BPEs – using frame importance sampling to focus on informative parts of the audio. The matcher is a neural network layer that combines convolutional layers (good at capturing local acoustic patterns) and attention mechanisms (good at capturing long-range dependencies). CTC is a training technique that allows the model to learn from audio-text pairs without requiring precise frame-level alignment.
the The second component It is a speech and text converter. The two-layer window query transformer (Q-Former) operates on blocks of 15 1024-dimensional audio embeddings coming from the last matching block, and downsamples by a factor of 5 using 3 trainable queries per block and per layer – for a total temporal downsampling factor of 10 – resulting in an audio embedding rate of 10 Hz for the LLM. This converter bridges the gap between continuous phonetic features and discrete text tokens, compressing the phonetic representation so that the language model can process it efficiently. In the NAR model, the Q-Former contains 160 million parameters and downsamples sequential hidden representations from four coding layers (layers 4, 8, 12, and 16).
the The third component It is a language model. Granite Speech 4.1 2B uses an intermediate checkpoint from the Granite 4.0-1b base with a context length of 128k, fine-tuned on all training sets. In the NAR variant, this becomes a 1B-parameter bi-directional LLM editor – Granit Base 4.0-1b with its causal attention mask removed to enable bi-directional context – adapted to the 128th ranked LoRA applied to both the attention and MLP layers.
Autoregressive versus nonregressive trade-off
This is where the two models diverge very sharply, and it has direct consequences for production deployment.
In Granite Speech 4.1 2B standard, text is generated waterfall-one token at a time, each dependent on every token before it. This results in accurate, stable text with full support for AST, keyword recognition, and punctuation, but is inherently sequential and slower at scale.
The Granite Speech 4.1 2B-NAR takes a radically different approach. Instead of decoding tokens one by one, it edits the CTC hypothesis in a single forward pass using a two-way LLM, achieving competitive accuracy with faster inference than autoregressive alternatives. This is the structure of NLE (Non-LLM-Based Editing). Concretely: the CTC encoder produces an approximate initial text, in which this hypothesis is interleaved with insertion slots, and then a two-way LLM predicts edits – copy, insert, delete or replace – at all positions simultaneously and in a single pass.
The NAR model measured an RTFx of around 1820 on a single H100 GPU using batch inference with a batch size of 128. RTFx (Real Time Factor Multiplier) measures how many times faster than real time the model can process audio – an RTFx of 1820 means a one-hour audio file can be transcribed in less than two seconds on this machine. A practical constraint engineer should note the following: The NAR model requires flash_attention_2 for inference, since this backend supports sequence packing and respects the is_causal=False flag.
Training data and infrastructure
The two models were trained on different datasets. The benchmark model was trained on 174,000 hours of audio recordings from public sources for ASR and AST, as well as synthetic datasets designed to support Japanese ASR, keyword-biased ASR, and speech translation. The NAR model was trained on approximately 130,000 hours of speech across five languages using publicly available datasets including CommonVoice 15, MLS, LibriSpeech, LibriHeavy, AMI, Granary VoxPpuli, Granary YODAS, Earnings-22, Fisher, CallHome, and SwitchBoard.
The infrastructure gap between the two is equally clear. Training on the benchmark model was completed in 30 days – 26 days for the encoder and 4 days for the renderer – on 8 H100 GPUs. The NAR model was trained in just 3 days on 16 H100 GPUs (2 nodes) for 5 epochs – a much lighter training, reflecting the architectural simplicity of editing via full autoregressive generation.
Key takeaways
Here are 5 brief key points:
- IBM has released two open models for ASR – Granite Speech 4.1 2B (autoregressive) and Granite Speech 4.1 2B-NAR (non-regressive) – both parameters ~2B, and licensed from Apache 2.0.
- The standard model achieves an average WER of 5.33 In the Open ASR Leaderboard, it supports 6 ASR languages (including Japanese), two-way speech translation, keyword bias, and punctuation/real text – competitive with models several times its size.
- The NAR model replaces capabilities with speed – It drops Japanese, AST, and keyword bias, but provides an RTFx of ~1820 on a single H100 GPU by releasing the CTC premise in a single forward pass instead of generating tokens one by one.
- The architecture has three basic components – A 16-layer Conformer encoder trained with a dual-head CTC, a 2-layer Q-Former projector that downscales the audio to a 10Hz modulation rate, and a fine-tuned 4.0-1b granite-base language model.
- A third variant, the Granite Speech 4.1 2B-Plus, also exists – Extending the standard model with speaker-attributed ASR and word-level timestamps for applications that require speaker identity and precise timing.
verify Granite Letter Sample 4.1 2b and Granite Form Letter 4.1 2B (NAR). Also, feel free to follow us on twitter Don’t forget to join us 130k+ ml SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.
Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us