Understanding Distillation Techniques LLM – MarkTechPost


Modern large language models are no longer trained solely on raw Internet text. Increasingly, companies are using powerful “teacher” models to help train smaller or more efficient “student” models. This process, widely known as LLM Distillation or Model-to-model trainingIt has become an essential technique for building high-performance models at lower computational cost. Meta used the massive Llama 4 Behemoth model to help train Llama 4 Scout and Maverick, while Google leveraged Gemini models during the development of Gemma 2 and Gemma 3. Likewise, DeepSeek extracted the inference capabilities from DeepSeek-R1 into smaller models based on Qwen and Llama.

The basic idea is simple: instead of learning only from human-written text, a student model can also learn from the outputs, probabilities, heuristics traces, or behaviors of another LLM. This allows smaller models to inherit capabilities such as reasoning, following instructions, and structured generation from much larger systems. Distillation can occur during pre-training, where teacher and student models are trained together, or during post-training, where a fully trained teacher transfers knowledge to a separate student model.

In this article, we will explore three main methods used to train LLMs using others: Soft label distillationWhere the student learns from the teacher’s probability distributions; Fixed distillationWhere the student imitates the teacher’s output; and Co-distillationMultiple models learn collaboratively by sharing expectations and behaviors during training.

Soft label distillation

Soft label distillation is a training technique where the volume is smaller Master’s student It learns by imitating the largest output probability distribution Master teacher. Instead of training only the next correct token, the student is trained to match the teacher’s softmax probabilities across the entire vocabulary. For example, if the teacher predicts the next symbol with probabilities e.g “cat” = 70%, “Dog” = 20%and “Animal” = 10%the student learns not only the final answer, but also the relationships and uncertainties between the different symbols. This richer signal is often called the teacher’s “dark knowledge” because it contains hidden information about patterns of thinking and semantic understanding.

The biggest advantage of flexible distillation is that it allows smaller models to inherit capabilities from much larger models while remaining faster and cheaper to deploy. Since the student learns from the teacher’s full probability distribution, training becomes more stable and informative compared to learning from difficult one-word targets only. However, this method also comes with practical challenges. To create soft labels, you need access to the records or weights of the feature model, which is often not possible with closed source models. In addition, storing probability distributions for each token across vocabularies containing more than 100K tokens becomes extremely memory-intensive at the LLM scale, making pure soft-label distillation expensive for trillion-token datasets.

Fixed distillation

Fixed-label distillation is a simpler approach where the LLM student learns only from the final predicted output code of the landmark model rather than from the full probability distribution. In this setting, a pre-trained teacher model generates the next most likely token or response, and the student model is trained using standard supervised learning to reproduce that output. The teacher essentially acts as a high-quality commentator who creates synthetic training data for the student. DeepSeek used this approach to extract the inference capabilities from DeepSeek-R1 to the smaller Qwen and Llama 3.1 models.

Unlike simplified distillation, the student does not see the teacher’s internal confidence scores or symbolic relationships – he only learns the final answer. This makes fixed-label distillation much cheaper computationally and easier to implement since there is no need to store huge probability distributions for each token. It is also especially useful when working with proprietary “black box” models such as GPT-4 APIs, where developers only have access to the generated text and not the underlying records. Although hard labels contain less information than soft labels, they are still very effective for instruction tuning, logical data sets, synthetic data generation, and domain-specific fine-tuning tasks.

Co-distillation

Co-distillation is a training technique where both teacher and student models are trained together instead of using a fixed teacher that has been trained previously. In this setup, an LLM teacher and an LLM student simultaneously process the same training data and generate their own softmax probability distributions. The teacher is naturally trained using ground truth labels, while the student learns by matching the teacher’s soft labels with actual correct answers. Meta used a form of this approach while training Llama 4 Scout and Maverick alongside the larger Llama 4 Behemoth form.

One challenge with co-distillation is that the landmark model is not fully trained during the early stages, which means that its predictions may initially be noisy or inaccurate. To overcome this, the student is typically trained using a combination of soft-labeled distillation loss and standard cross entropy loss. This creates a more stable learning signal while allowing knowledge transfer between models. Unlike traditional one-way distillation, co-distillation allows both models to optimize together during training, often resulting in better performance, stronger logic transfer, and smaller performance gaps between teacher and student models.

Compare the three distillation techniques

Soft-label distillation conveys the richest forms of knowledge because the student learns from the teacher’s full probability distribution rather than just the final answer. This helps smaller models capture patterns of inference, uncertainty, and relationships between tokens, often leading to stronger overall performance. However, they are computationally expensive, require access to feature records or weights, and become difficult to scale because storing probability distributions for large vocabulary consumes enormous memory.

Hard label distillation is simpler and more practical. The student only learns from the teacher’s final output, which makes it much cheaper and easier to implement. It works especially well with proprietary black box models such as GPT-4 APIs where internal possibilities are not available. While this approach loses some of the deeper “dark knowledge” found in soft nomenclature, it is still very effective for tuning instructions, generating synthetic data, and fine-tuning a specific task.

Co-distillation takes a collaborative approach where teacher and student models learn together during training. The teacher improves while simultaneously guiding the student, allowing both models to benefit from common learning cues. This can reduce the performance gap seen in traditional one-way distillation methods, but it also makes training more complex since the teacher’s predictions are initially unstable. In practice, soft-label distillation is preferred for maximum knowledge transfer, hard-label distillation for scalability and practicality, and co-distillation for large-scale co-training settings.


I am a graduate of Civil Engineering (2022) from Jamia Millia Islamia University, New Delhi, and I have a keen interest in data science, especially neural networks and their applications in various fields.

Leave a Reply