NVIDIA AI launches Star Elastic: a single checkpoint containing 30B, 23B, and 12B inference models with zero-threshing technology


Training a family of large language models (LLMs) has always come with a painful complication: each different model in the family – whether 8B, 30B, or 70B – typically requires its own complete training run, its own storage, and its own deployment stack. For a development team running inference at scale, this means multiplying the compute costs by the number of model sizes they want to support. NVIDIA researchers now propose a different approach called Flexible star.

Flexible star It is a post-training method that combines multiple nested sub-models – with different parameter budgets – within a single main inference model, using a single training run. apply to Nemotron nano v3 (Hybrid Mamba-Transformer-MoE model with 30B total parameters and 3.6B active parameters), Star Elastic produces 23B (2.8B active) and 12B (2.0B active) nested variants trained with about 160B tokens. All three variables live in a single checkpoint and can be extracted without any additional tuning.

What does “nested” actually mean here

If you haven’t encountered flexible or nested architectures before, the idea is this: instead of training three separate models 30B, 23B and 12B, you train one model Includes Smaller ones as subsets of themselves. Smaller submodels reuse the most significant weights from the original model, which were determined through a process called Appreciate the importance.

Star Elastic scores each component of the model: embedding channels, attention vertices, Mamba SSM vertices, MOE experts, and FFN channels by how much they contribute to the accuracy of the model. The components are then ranked and sorted, so that the smaller budget submodels always use the highest-ranking contiguous subset of components from the larger model. This property is called Overlapping weight sharing.

The method supports nesting along multiple axes: the SSM (state space model) dimension, embedding channels, attention vertices, Mamba vertices and vertices, number of MOE experts, and the average FFN dimension. For MoE layers specifically, Star Elastic is used Router Weighted Expert Activation Pruning (REAP)which ranks experts according to their two routing portal values and Expert output sizes – a more principled signal than simple frequency-based pruning, which ignores how much each expert actually contributes to the layer’s output.

A learnable router, not a static compression recipe

The main difference between previous compression methods such as Minitron is that Star Elastic is used An end-to-end trainable router To define nested submodel structures. The router takes a target budget (for example, “Give me an active parameter model 2.8B”) as one input and outputs differentiable masks that identify the active components at that budget level. These masks are jointly trained with the model through Gamble-Soft Maxwhich allows the gradient to flow through discrete architectural decisions.

Pool loss function Distillation of knowledge (Kuwaiti dinar) The original inflexible model acts as a parameter with a router loss that penalizes deviation from the target resource budget (number of parameters, memory, or access time). This means that the router learns to make architectural choices that actually improve accuracy within the KD range, rather than simply reducing the proxy scale.

Training uses a The curriculum has two stages: A short context phase (sequence length 8,192 characters) with uniform budget sampling, followed by an extended context phase (sequence length 49,152 characters) with non-uniform sampling that prioritizes the full 30B model (p(30B)=0.5, p(23B)=0.3, p(12B)=0.2). The extended context phase is crucial to the performance of reasoning. The research team’s eradications on the Nano v2-which were explicitly reproduced as an empirical basis for the same approach selection on the Nano v3-show gains of up to 19.8% on AIME-2025 for the 6B variant and 4.0 percentage points for the 12B variant from Phase 2 alone, motivating its use here.

Flexible budget control: Different models for different stages of thinking

Current budget control works in inference models including the default behavior of Nemotron Nano v3 by placing a maximum limit on the number of tokens generated during The stage before imposing the final answer. This approach uses the same model throughout. Star Elastic opens up a different strategy: usage Various nested sub-models For the thinking stage versus the answering stage.

The researchers evaluated four configurations. Optimum, called ℳS → ℳL (small model for reasoning, large model for answering), allocates a cheaper model to generate extended inference traces and reserves the full-capacity model to compile the final answer. The 23B → 30B configuration in particular improves the accuracy and latency to the Pareto frontier, achieving up to 16% higher accuracy and 1.9x lower latency Compared to the default Nemotron Nano v3 budget control. Intuition: Logical codes are large in size but tolerate some reduction in capacity; The final answer requires higher accuracy.

Quantization without breaking the nested structure

A naive way to deploy a quantitative elastic model is to quantify each variable separately after discretization. This breaks the overlapping weight sharing property and requires a separate quantization pass for each size. Instead, Star Elastic is applied Quantity-Aware Distillation (QAD) directly on the elastic checkpoint, maintaining the nested mask hierarchy throughout.

For FP8 (E4M3 format), post-training (PTQ) estimation is sufficient, recovering 98.69% of the accuracy of BF16 in version 30B. For NVFP4 (NVIDIA’s 4-bit floating point format), PTQ alone causes an average precision drop of 4.12%, so a short interleaved QAD phase (about 5 billion tokens in a 48KB context) returns the recovery to 97.79% for a 30-byte variant. In both cases, the zero dichotomy of variants 23B and 12B of the single quantum checkpoint is preserved.

The memory implications are significant. 12B, 23B and 30B BF16 separate checkpoint storage requires 126.1 GB; One elastic checkpoint requires 58.9GB. The 30B NVFP4 flexible checkpoint fits into 18.7GB, allowing the 12B NVFP4 variant to run on an RTX 5080 where every BF16 configuration runs out of memory. On the RTX Pro 6000, the 12B NVFP4 variant reaches 7,426 characters/s, a 3.4x throughput improvement over the 30B BF16 baseline.

Depth vs. Width: Why does Star Elastic compress width?

One design choice worth mentioning explicitly: The research team compared two compression strategies – removing layers entirely (depth compression) versus reducing internal dimensions such as hidden volume, number of experts, and vertex count (width compression). With a 15% reduction in parameters and 25 billion tokens for knowledge distillation, supply compression was restored 98.1% of basic performance While only depth pressure was recovered 95.2%,with significant degradation in HumanEval and MMLU-Pro. As a result, Star Elastic prioritizes width-based elasticity for its key results, although depth compression (layer skipping) remains available as a mechanism for severe latency-constrained scenarios.

In the evaluation suite – AIME-2025, GPQA, LiveCodeBench v5, MMLU-Pro, IFBench, and Tau Bench – the Elastic-30B variant matches the original Nemotron Nano v3 30B on most benchmarks, while the Elastic-23B and Elastic-12B variants remain competitive against independently trained models of similar sizes. Elastic-23B scored 85.63 points in AIME-2025 versus 80.00 for Qwen3-30B-A3B, despite having fewer active parameters.

Regarding the cost of training, the research team reported A Token reduction by 360x Compared to pre-training each variant from scratch, and a 7× discount Over the previous state-of-the-art compression methods that required sequential distillation runs for each sample size. The 12B variant runs at 2.4x the throughput of its 30B parent on the H100 GPU at bfloat16 with the same I/O sequence lengths.

How to use Nvidia Star Elastic

Step by step guide

Nemotron Nano v3 Flexible – 30B / 23B / 12B in one checkpoint · BF16 / FP8 / NVFP4

Star Elastic models are distributed via Hugging Face and support both
transformers (for experimentation) and vLLM
(Recommended for production inference). Choose the option that best suits your use case.

crush

 Option A - vLLM (recommended for production serving)
pip install vllm

 Option B - Transformers (for local experimentation)
pip install transformers torch accelerate

 Optional: log in to Hugging Face if needed
pip install huggingface_hub
huggingface-cli login



Hardware Note: The 30B BF16 checkpoint requires about 60GB of VRAM for the entire nested family. Use FP8 (~31GB) or NVFP4 (~19GB) for H100/A100 or RTX series deployment.

A single checkpoint contains all three intervening variables – 30B (3.6A),
23B (2.8A)and 12b (2.0a). Download once; Extract any variable without retraining. The form is required trust_remote_code=True For the Mamba-Transformer-MoE hybrid architecture.

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

 The 30B BF16 elastic checkpoint - contains all 3 nested variants
model_id = "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"      distributes across available GPUs
)

print(f"Model loaded: {model_id}")



Active parameters vs total parameters: “30 billion total / 3.6 billion active” means that the model stores 30 billion weights but only routes each token through 3.6 billion parameters per forward pass – this is how Mixture of Experts (MoE) works.

Model A is used Token to create a logical string before producing its final answer. Control the total token budget via max_new_tokens
-Higher values ​​allow longer inference traces on difficult problems.

python

messages = [
    {
        "role": "user",
        "content": "What is the time complexity of QuickSort, and why?"
    }
]

 Apply chat template and tokenize
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

 Generate - model produces ... then the final answer
outputs = model.generate(
    **inputs,
    max_new_tokens=4096,     thinking + answer budget
    temperature=0.6,
    top_p=0.95,
    do_sample=True
)

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True
)
print(response)



Tip on budget thinking: For math/coding problems, assign max_new_tokens
To 8192-32768. For simpler queries, 2048-4096 is sufficient and reduces latency.

For production deployments, use vLLM To serve the model via an OpenAI compatible REST API. This enables batch inference, continuous aggregation, and higher throughput – something the 12B variant achieves 2.4 x productivity
From the 30B parent on the H100 GPU.

crush

 Start the vLLM server (OpenAI-compatible)
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

 --- In a separate terminal ---

 Query the server via curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16",
    "messages": [
      {
        "role": "user",
        "content": "Explain gradient descent in 3 steps."
      }
    ],
    "max_tokens": 4096,
    "temperature": 0.6
  }'

 Or run via Docker
docker model run hf.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16



SGLang alternative: SGLang is also supported – ON python3 -m sglang.launch_server --model-path "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16" --port 30000
For a direct replacement for vLLM.

Three quantitative checkpoints are available. Everyone memorizes Nested structure
– Subforms 23B and 12B can be extracted without a shot from any precision checkpoint you load. NVFP4 uses Quantity-Aware Distillation (QAD) technology to recover the lost precision of PTQ.

crush

 BF16 - full precision, all nested variants in 58.9 GB
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

 FP8 (E4M3) - ~2x smaller, 30B fits in 31.4 GB
 Post-training quantization, 98.69% accuracy recovery on 30B
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-FP8"

 NVFP4 - smallest footprint, 30B fits in 18.7 GB
 12B NVFP4 variant runs on RTX 5080 (BF16 OOMs)
 12B NVFP4 on RTX Pro 6000: 7,426 tokens/s (3.4x vs 30B BF16)
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4"

alternative 30B memory 23B memory 12B memory Best for
BF16 Full 58.9 GB 44.0 GB 23.2 GB A100/H100
FP8 Ptk 31.4 GB 23.7 GB 13.0 GB H100 / A100 / RTX 5090
NVFP4 QAD 18.7 GB 14.1 GB 8.0 GB RTX 5080 / 5090 / Pro 6000


step 1 From 5

Key takeaways

  • Star Elastic trains 30B, 23B, and 12B nested inference models from a single run of 160B tokens post-training, achieving a 360x token reduction compared to pre-training from scratch.
  • Flexible budget control (23B for thought, 30B for answer) improves accuracy – Pareto frontier latency by up to 16% accuracy and latency gains of 1.9x.
  • Learnable Vector with Gumbel-Softmax enables selection of an end-to-end trainable architecture, eliminating the need for separate compressions for each model size.
  • Nested QAD maintains zero-threading across FP8 and NVFP4 quantum checkpoints, reducing the 30B elastic checkpoint to 18.7GB in NVFP4.
  • All three exact variants (BF16, FP8, NVFP4) are publicly available on Hugging Face below nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B.

verify paper, Flexible models on Face hugging BF16, FP8 and NVFP4 . Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us


Leave a Reply