Knowledge Distillation And Model Compression For Downstream Tasks

1

DeepSpeedFramework60/100

via “model compression through pruning and distillation”

Microsoft's distributed training library — ZeRO optimizer, trillion-parameter scale, RLHF.

Unique: Combines structured pruning with knowledge distillation; supports both unstructured and structured sparsity patterns with automatic fine-tuning to recover accuracy

vs others: More integrated than separate pruning/distillation tools; automatic fine-tuning reduces manual tuning effort

2

SmolLMModel59/100

Hugging Face's small model family for on-device use.

Unique: SmolLM's curated training data provides a high-quality teacher signal for distillation — student models distilled from SmolLM achieve better generalization than those distilled from generic large models; supports both response-based and feature-based distillation strategies

vs others: Models distilled from SmolLM 1.7B outperform models distilled from Llama 2 7B at equivalent student size due to better data quality, and distilled SmolLM students are 2-3x smaller than TinyLlama while maintaining comparable performance

3

all-MiniLM-L6-v2Model58/100

via “efficient-inference-with-model-distillation”

sentence-similarity model by undefined. 23,35,18,673 downloads.

Unique: Uses asymmetric distillation where student (6 layers) learns from teacher (12 layers) via MSE loss on hidden states and attention patterns, not just final embeddings; preserves semantic structure while reducing depth, enabling both speed and quality retention

vs others: Faster inference than full BERT-base (5-10x) and smaller than full models (22.7M vs 110M params), though slower than extreme compression techniques (TinyBERT, MobileBERT) which sacrifice more quality; better quality-to-speed trade-off than quantization-only approaches

4

Llama 3.1 405BModel57/100

via “model distillation and knowledge transfer to smaller models”

Largest open-weight model at 405B parameters.

Unique: 405B enables distillation at unprecedented scale in open source, allowing creation of smaller models that inherit 405B's capabilities through synthetic data generation and knowledge transfer, previously unavailable in open-source ecosystem

vs others: Larger model scale enables higher-quality synthetic data and more effective distillation than smaller open-source models; however, inference cost for distillation is higher than proprietary distillation services

5

DeepSeek R1Model57/100

via “reasoning model distillation to smaller parameter scales”

Open-source reasoning model matching OpenAI o1.

Unique: Applies distillation to reasoning models across 6 different scales (1.5B-70B), which is rare for frontier reasoning models. Most competitors only offer single-size deployment.

vs others: Provides multiple distilled sizes enabling flexible deployment, whereas o1 only offers cloud API access at fixed capability level.

6

gpt2Model56/100

via “knowledge distillation for model compression”

text-generation model by undefined. 1,60,37,172 downloads.

Unique: Enables knowledge transfer from larger teacher (GPT-2) to smaller student via soft target matching, preserving linguistic knowledge while reducing parameters — complementary to quantization for extreme compression

vs others: More effective than quantization alone for large compression ratios (5-10x), but requires training vs quantization's post-hoc approach — best combined with quantization for maximum compression

7

llmcompressorRepository56/100

via “distributed compression for models exceeding single-gpu memory”

Toolkit for LLM quantization, pruning, and distillation.

Unique: Implements distributed compression by partitioning models across GPUs, coordinating calibration data flow, and synchronizing quantization parameters across devices, enabling compression of models 2-3x larger than single-GPU capacity without requiring distributed training infrastructure

vs others: More practical than distributed training because it only requires calibration, not full retraining; more efficient than sequential processing because it parallelizes across GPUs; more flexible than cloud quantization services because it runs on-premises

8

roberta-baseModel53/100

via “efficient inference via model quantization and distillation”

fill-mask model by undefined. 1,90,34,963 downloads.

Unique: RoBERTa-base's 110M parameters and 12-layer architecture provide good compression targets — distilled models retain 95%+ accuracy while achieving 3-4x speedup, and INT8 quantization is particularly effective due to the model's learned robustness to weight perturbations from improved pretraining

vs others: More amenable to quantization than BERT due to improved pretraining; better compression targets than larger models (RoBERTa-large) while maintaining competitive accuracy; distilled RoBERTa variants outperform DistilBERT on most benchmarks

9

distilbert-base-multilingual-casedModel50/100

via “efficient inference with model quantization and onnx export”

fill-mask model by undefined. 13,07,729 downloads.

Unique: Combines knowledge distillation (6-layer architecture) with ONNX export and quantization support, enabling a 4-8x inference speedup and 75% model size reduction. This is architecturally distinct because the distilled base model is already optimized for efficiency, making it an ideal candidate for further compression without catastrophic accuracy loss.

vs others: Achieves better inference efficiency than BERT-base-multilingual-cased (4-8x speedup with quantization) while maintaining comparable accuracy; TinyBERT offers more aggressive compression but with greater accuracy trade-offs and limited multilingual support.

10

distilbert-base-multilingual-cased-sentiments-studentModel49/100

via “efficient-inference-with-model-distillation”

text-classification model by undefined. 6,63,335 downloads.

Unique: Combines DistilBERT's architectural compression (6 vs 12 layers, shared attention heads) with knowledge distillation from a stronger DeBERTa-v3 teacher, achieving both size reduction and maintained accuracy. Supports ONNX export for hardware-agnostic optimization, enabling deployment across CPUs, GPUs, and specialized inference accelerators.

vs others: Smaller and faster than full multilingual BERT/DeBERTa models while maintaining better accuracy than lightweight alternatives like TinyBERT, making it ideal for production systems balancing speed, accuracy, and resource constraints.

11

ai-notesRepository49/100

via “small models and efficient ai tracking”

notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.

Unique: Tracks the full spectrum of model efficiency techniques (quantization, distillation, pruning, architecture search) and their impact on model capabilities, rather than treating efficiency as a single dimension

vs others: More comprehensive than individual model documentation because it covers the landscape of efficient models, but less detailed than specialized optimization frameworks

12

nllb-200-distilled-600MModel48/100

via “distilled transformer inference with knowledge transfer”

translation model by undefined. 13,09,929 downloads.

Unique: Applies knowledge distillation specifically to the M2M-100 architecture, preserving the multilingual shared embedding space while reducing parameters by 82%. Uses logit matching and intermediate layer alignment to transfer the teacher's translation knowledge, enabling competitive performance on 200 language pairs with a single 600M-parameter model.

vs others: Smaller than full NLLB-200 (600M vs 3.3B) with faster inference than uncompressed models, but slower and lower quality than language-specific models fine-tuned for single pairs; trade-off is worthwhile for multilingual coverage on resource-constrained devices.

13

distilroberta-baseModel47/100

via “knowledge-distillation-from-roberta-base”

fill-mask model by undefined. 10,73,316 downloads.

Unique: Distilled from RoBERTa-base using standard knowledge distillation (MSE loss on hidden states + MLM loss) achieving 95-98% of teacher performance with 66% parameter reduction, representing a favorable compression-accuracy tradeoff compared to training smaller models from scratch

vs others: Maintains RoBERTa's superior pretraining procedure (dynamic masking, longer training) while achieving efficiency comparable to ALBERT or MobileBERT, and outperforms BERT-base distillations due to better teacher model quality

14

nli-MiniLM2-L6-H768Model44/100

via “distilled transformer inference with reduced parameter footprint”

zero-shot-classification model by undefined. 2,58,745 downloads.

Unique: Distilled from RoBERTa-Large specifically for NLI tasks using knowledge distillation, achieving 15x parameter reduction while maintaining >90% of teacher model accuracy on SNLI/MultiNLI benchmarks — most lightweight NLI alternatives either use non-distilled architectures or sacrifice accuracy more severely

vs others: Faster CPU inference than full-size cross-encoders (RoBERTa-Large, BERT-Large) by 3-5x; more accurate than simple bi-encoder baselines on entailment tasks due to cross-encoder architecture, despite smaller size

15

mobilebert-uncased-squad-v2Model39/100

via “knowledge distillation-based model compression for transfer learning”

question-answering model by undefined. 32,657 downloads.

Unique: MobileBERT uses inverted bottleneck architecture (wide intermediate layers, narrow hidden states) combined with intermediate layer distillation, achieving superior compression compared to simple pruning or quantization. This architectural design is inherently distillation-friendly, enabling efficient knowledge transfer.

vs others: More effective knowledge transfer than DistilBERT (which uses only final layer distillation) due to intermediate layer matching; enables fine-tuning on custom datasets with better accuracy retention than training smaller models from scratch.

16

FlagEmbeddingModel37/100

via “knowledge distillation for model compression”

Retrieval and Retrieval-augmented LLMs

Unique: FlagEmbedding provides retrieval-specific knowledge distillation framework that preserves embedding quality and ranking performance through teacher-student training with contrastive and ranking-aware losses.

vs others: Offers retrieval-optimized distillation compared to generic model compression, maintaining ranking quality while reducing model size.

17

HunyuanVideo-1.5Model35/100

via “step distillation for reduced diffusion iterations”

HunyuanVideo-1.5: A leading lightweight video generation model

Unique: Uses knowledge distillation to train a student model that predicts multi-step trajectories, rather than simple output matching. The student learns to approximate the full diffusion process in fewer steps by matching the teacher's intermediate representations, not just final outputs.

vs others: Faster than DDIM or other fast samplers because it's trained specifically for few-step generation, versus generic acceleration techniques that apply to any diffusion model.

18

NVIDIA: Llama 3.3 Nemotron Super 49B V1.5Model25/100

via “inference-optimization-via-model-distillation-from-70b-to-49b”

Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and...

Unique: Knowledge distillation from 70B to 49B with agentic-specific post-training preserves tool-calling and RAG performance while reducing parameters by 30%, enabling faster inference than 70B without generic distillation quality loss

vs others: More efficient than running full 70B model while maintaining better reasoning than smaller models like Llama-3.1-8B, though with some capability trade-off vs full 70B

19

AionLabs: Aion-1.0-MiniModel24/100

via “knowledge distillation-based reasoning compression”

Aion-1.0-Mini 32B parameter model is a distilled version of the DeepSeek-R1 model, designed for strong performance in reasoning domains such as mathematics, coding, and logic. It is a modified variant...

Unique: Applies knowledge distillation to compress DeepSeek-R1's reasoning capability into 32B parameters, enabling reasoning-based inference at lower cost and latency than full R1

vs others: More efficient than full R1 (32B vs 671B) while retaining reasoning capability, though with unknown performance trade-offs vs. non-distilled reasoning models

20

Amazon: Nova Premier 1.0Model24/100

via “knowledge distillation for custom model training”

Amazon Nova Premier is the most capable of Amazon’s multimodal models for complex reasoning tasks and for use as the best teacher for distilling custom models.

Unique: Amazon positions Nova Premier specifically as a distillation teacher with optimized output formats and intermediate representations designed for knowledge transfer, rather than as a general-purpose model that happens to support distillation as an afterthought

vs others: Designed from the ground up for distillation workflows with better cost-to-quality ratio than using GPT-4 or Claude as a teacher, making it more economical for teams building custom models at scale

Top Matches

Also Known As

Company