Researchers at Meta and Stanford propose a fast-byte latent converter that reduces inference memory bandwidth by more than 50% without requiring coding.

A team of researchers from Meta, Stanford University, and the University of Washington have presented three new methods that speed up latent byte generation (BLT) – a language prototyping structure that operates directly on raw bytes rather than tokens.

Byte level models are slow to infer

To understand what this new research solves, you need to understand the trade-off at the center of byte-level language modeling.

Most language models work today Symbols – Parts of text produced by subword tokenization tools such as Byte Pair Encryption (BPE). A token usually represents several letters or even an entire word. While this is effective, encoding has known downsides: sensitivity to input noise, poor handling of multilingual text, poor character-level understanding, and fragility of structured input such as code and numbers.

Byte-level models avoid all of this by working directly on raw bytes – the lowest level of text representation. The Byte Latent Transformer (BLT) was a big step forward: it matched the performance of large-scale encoding-based models by dynamically grouping bytes into variable lengths. splatter Using an entropy-based segmentation strategy. Regions with high entropy (which are difficult to predict) get shorter patches; More predictable periods get longer periods. The bulk of the account spans Latent symbolic representationsnot raw bytes – using three components: a local encoder, a large global converter, and a local decoder – with an average patch size of 4 bytes and a maximum of 8.

The remaining problem is the speed of inference. Even with the hierarchical design of BLT, the local decoder still generates one byte at a time in an autoregressive manner. Since a typical subword token corresponds to several bytes, BLT needs multiple forward passes to the decoder to produce the same amount of text as the token-level model in one step. In modern LLM service, the bottleneck is often not the calculation but the Memory bandwidth – Load model weights and key-value cache from memory repeatedly. More forward passes to the decoder means more memory overhead, which directly translates into slower generation.

Three methods, one goal: fewer forward passes

The research team offers three techniques that reduce this bottleneck, each trading speed versus generation quality differently.

BLT Spread (BLT-D)

It is the essential contribution and the fastest alternative. The basic idea is to replace byte-by-byte decoding with autoregression Discrete block propagation In local decoder.

During training, the decoder receives two inputs: a clean byte sequence (the original text) and harmony Sequence of fixed-length byte blocks. For each block, a continuous propagation time step t of U(0,1) is sampled, and each byte in the block is identified Independently He was replaced by A [MASK] Symbol with probability t. This means that the degree of masking varies depending on the training example – a lower t results in the most bytes being visible; Top masks most of them. The block size B (which was set to 4, 8, or 16 bytes in the experiments) typically extends beyond the average BLT patch size of 4 bytes, which teaches the decoder to predict more bytes into the future than it normally would. The total training loss combines the standard prediction loss of the next byte in the clean sequence and the prediction loss of the masked byte on the bad blocks – conceptually similar to how masked language modeling works in BERT, but applied at the byte level within the hierarchical structure of BLT.

Upon inference, BLT-D initializes a block of [MASK] Positions It iteratively detects multiple byte positions for each decoder step using one of two strategies: confidence-based detection (detect positions whose expected probability exceeds threshold α) or entropy-bound (EB) sampling (select the largest subset of positions whose cumulative entropy remains below threshold γ). Both strategies generate multiple bytes for each forward pass instead of one. The global encoder and model – expensive components of BLT – are called once per block instead of once per patch, further reducing the total model calls. BLT-D also supports KV caching, taking advantage of any techniques that reduce the KV cache footprint.

At parameters 3B, BLT-D-4 (block size 4) nearly matches the BLT task scores while requiring less than half the memory bandwidth. BLT-D-16 (block size 16) achieves an 87-92% reduction in the estimated cost of memory bandwidth compared to BLT, making it the fastest configuration evaluated – despite lower pass@1 scores in the coding benchmarks (HumanEval, MBPP).

BLT Self-Scalping (BLT-S)

Takes a different route, depending on Speculative decoding – A technique where a cheap draft model proposes tokens and a larger model verifies them in parallel. What makes BLT-S unusual is that it requires neither a separate draft model nor architectural changes or additional training. It reuses the lightweight local decoder found in BLT as the syntactic tool.

In standard BLT heuristics, the decoder stops generating when the entropy-based correction program determines that a new correction limit has been reached-usually every four bytes. Alternatively, BLT-S allows the decoder to generate up to a fixed window size k (8 or 16 bytes in experiments) regardless of entropy spikes, adjusting to the last available latent code. After producing a draft of k bytes, the full model re-encodes the candidate sequence through the encoder, global model, and decoder and produces predictions of the next byte. Formatted bytes are accepted until the first mismatch; The first unmatched byte is replaced by the verified prediction.

Under greedy decoding, this procedure ensures that the verified output is match to standard BLT decoding – without loss of quality. BLT-S slightly increases the decoder’s forward passes but significantly reduces encoder and global model calls. At parameters 3B with k=16, BLT-S may achieve up to 77% reduction of memory bandwidth without any loss in task performance.

BLT Deployment+Verification (BLT-DV)

He’s sitting in the middle. Because BLT-D is trained using the propagation target and the standard next-byte prediction target, the same model weights can be run regression-wise using causal decoder masks – no separate model needed and no additional training needed. BLT-DV exploits this: diffusion first drafts a block of bytes, then a single forward regression pass verifies the draft, accepting bytes until the first mismatch. Experimentally, one-step propagation with verification resulted in the fastest BLT-DV formation. While single-step propagation typically leads to rapid degradation of generation quality, the verification step effectively prevents this. At parameters 3B, BLT-DV may achieve up to 81% reduction in memory bandwidth compared to BLT.

Understanding numbers

All models were trained on the BLT-1T dataset (1 trillion publicly sourced tokens including a subset of Datacomp-LM), with 1B parameter models trained for 240,000 steps and 3B parameter models for 480,000 steps. The evaluation covered four generation tasks: translation from French to English and from German to English using the FLORES-101 standard (4-shot, SentencePiece BLEU) and two coding standards – HumanEval (0-shot, pass@1) and MBPP (3-shot, pass@1).

In addition to the generation tasks, the research team also evaluates BLT-D based on five probability-based criteria: ARC-Easy, ARC-Challenge, PIQA, HellaSwag, and MMLU. Since BLT-D is trained on the next byte prediction target along with the propagation target, it can calculate autoregressive probabilities by applying a causal mask to the decoder – the same mechanism that BLT-DV’s verification step relies on. The results show that the BLT-D variants achieve scores close to the BLT baseline on all five measures, confirming that incorporating block diffusion does not affect the autoregressive reasoning ability of the model.

Efficiency is reported through three proxy metrics: decoder network function evaluations (NFEs), encoder/global model NFEs, and an estimated memory bandwidth figure in GB derived from the number of parameters and the number of 16-bit forward passes. The research team has shown that these are surrogate metrics – converting NFE reductions into actual clock improvements requires greatly improved implementation of the heuristics, which the research team points to as the most important direction for future work.

Translation tasks benefit more from BLT-D across all block sizes. Coding tasks show greater sensitivity to block size: BLT-D-16 provides the largest efficiency gains but shows a noticeable drop in scores on HumanEval and MBPP. An additional notable result comes from the generational diversity analysis: when entropy-limited samples with higher p sampling are used when inferring, more non-functional elements in the decoder are associated with a higher token type ratio (a measure of lexical diversity). This means that the efficiency-diversity trade-off is tunable at inference time without any retraining.

Key takeaways

BLT-D introduces separate block-level propagation in BLT’s local decoder, training with built-in next byte prediction and masked byte prediction loss to generate multiple bytes per forward pass instead of one by one
BLT-S uses BLT’s lightweight decoder as a speculative formulation tool-no separate model, no architectural changes, no additional training-and produces output identical to standard BLT under greedy decoding.
BLT-DV combines the diffusion formulation with an autoregressive validation step using the same weights as the BLT-D model, recovering the quality lost in diffusion-only decoding without additional training.
All approaches may achieve an estimated memory bandwidth cost of more than 50% less than BLT on generation tasks; BLT-D-16 may reach 87-92% reduction
The autoregressive capability of BLT-D remains robust on probability-based benchmarks (ARC-Easy, ARC-Challenge, PIQA, HellaSwag, and MMLU), and its generation variability can be tuned at inference time via entropy-limited sampling thresholds.

verify paper. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us

Researchers at Meta and Stanford propose a fast-byte latent converter that reduces inference memory bandwidth by more than 50% without requiring coding.

Byte level models are slow to infer