Knowledge Distillation Based Reasoning Compression

1

DeepSeek R1Model57/100

via “reasoning model distillation to smaller parameter scales”

Open-source reasoning model matching OpenAI o1.

Unique: Applies distillation to reasoning models across 6 different scales (1.5B-70B), which is rare for frontier reasoning models. Most competitors only offer single-size deployment.

vs others: Provides multiple distilled sizes enabling flexible deployment, whereas o1 only offers cloud API access at fixed capability level.

2

o4-miniModel56/100

via “compact reasoning model with stem optimization”

Latest compact reasoning model with native tool use.

Unique: Domain-specific distillation trained on curated STEM datasets rather than general reasoning; uses sparse attention and quantized embeddings to compress reasoning capability into a mini-class model, achieving 10-50x cost reduction vs. o1/o3 while maintaining domain-specific reasoning quality.

vs others: Cheaper and faster than o1/o3 for STEM workloads (estimated 5-10x cost reduction, 3-5x latency reduction) but with narrower reasoning scope; stronger than GPT-4o on math/physics but weaker on general reasoning tasks requiring cross-domain knowledge.

3

mobilebert-uncased-squad-v2Model39/100

via “knowledge distillation-based model compression for transfer learning”

question-answering model by undefined. 32,657 downloads.

Unique: MobileBERT uses inverted bottleneck architecture (wide intermediate layers, narrow hidden states) combined with intermediate layer distillation, achieving superior compression compared to simple pruning or quantization. This architectural design is inherently distillation-friendly, enabling efficient knowledge transfer.

vs others: More effective knowledge transfer than DistilBERT (which uses only final layer distillation) due to intermediate layer matching; enables fine-tuning on custom datasets with better accuracy retention than training smaller models from scratch.

4

Prime Intellect: INTELLECT-3Model26/100

via “logical-reasoning-and-formal-inference”

INTELLECT-3 is a 106B-parameter Mixture-of-Experts model (12B active) post-trained from GLM-4.5-Air-Base using supervised fine-tuning (SFT) followed by large-scale reinforcement learning (RL). It offers state-of-the-art performance for its size across math,...

Unique: RL post-training optimizes for logical consistency and formal correctness in reasoning traces; uses chain-of-thought patterns that decompose inference into verifiable steps rather than end-to-end black-box reasoning

vs others: Produces more transparent and verifiable reasoning than single-step models while maintaining efficiency through MoE routing that activates only reasoning-specific experts

5

Cohere: Command R7B (12-2024)Model26/100

via “complex reasoning and chain-of-thought decomposition”

Command R7B (12-2024) is a small, fast update of the Command R+ model, delivered in December 2024. It excels at RAG, tool use, agents, and similar tasks requiring complex reasoning...

Unique: Command R7B's reasoning is optimized for RAG and tool-use contexts, where intermediate steps can reference retrieved documents or tool outputs, enabling grounded reasoning that combines external knowledge with logical inference

vs others: Outperforms GPT-4 on MATH and AIME benchmarks when combined with tool use for calculation, because it can delegate computation to tools rather than attempting symbolic math in-context

6

AionLabs: Aion-1.0-MiniModel24/100

via “knowledge distillation-based reasoning compression”

Aion-1.0-Mini 32B parameter model is a distilled version of the DeepSeek-R1 model, designed for strong performance in reasoning domains such as mathematics, coding, and logic. It is a modified variant...

Unique: Applies knowledge distillation to compress DeepSeek-R1's reasoning capability into 32B parameters, enabling reasoning-based inference at lower cost and latency than full R1

vs others: More efficient than full R1 (32B vs 671B) while retaining reasoning capability, though with unknown performance trade-offs vs. non-distilled reasoning models

7

DeepSeek: R1 Distill Qwen 32BModel24/100

via “knowledge distillation-based reasoning transfer”

DeepSeek R1 Distill Qwen 32B is a distilled large language model based on [Qwen 2.5 32B](https://huggingface.co/Qwen/Qwen2.5-32B), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). It outperforms OpenAI's o1-mini across various benchmarks, achieving new...

Unique: Uses knowledge distillation to transfer R1's reasoning capability to a 32B model, enabling R1-quality reasoning at 1/3 parameter count through supervised fine-tuning on R1 outputs

vs others: More efficient than full R1 while maintaining reasoning quality, and more transparent than black-box reasoning models like o1 through explicit reasoning traces

8

DeepSeek: R1 Distill Llama 70BModel24/100

via “domain-specific knowledge synthesis and explanation”

DeepSeek R1 Distill Llama 70B is a distilled large language model based on [Llama-3.3-70B-Instruct](/meta-llama/llama-3.3-70b-instruct), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). The model combines advanced distillation techniques to achieve high performance across...

Unique: Embeds R1's reasoning distillation into domain knowledge synthesis, enabling the model to not just retrieve facts but reason through their implications and connections. This produces more coherent, logically-sound explanations than fact-retrieval alone, particularly for interdisciplinary questions.

vs others: Provides reasoning-transparent domain explanations with lower latency than full R1, while offering stronger logical coherence than base Llama-3.3 due to R1 distillation.

9

huggingface.co/Meta-Llama-3-70B-InstructModel23/100

via “reasoning and chain-of-thought problem decomposition”

|[GitHub](https://github.com/meta-llama/llama3) ![GitHub Repo stars](https://img.shields.io/github/stars/meta-llama/llama3?style=social)| Free |

Unique: Instruction-tuned specifically on reasoning-focused datasets with explicit step-by-step annotations, enabling the model to naturally generate transparent reasoning traces without requiring special prompting techniques. The 70B parameter scale allows for nuanced reasoning across diverse domains while maintaining interpretability of intermediate steps.

vs others: More transparent and auditable reasoning than models optimized purely for answer accuracy, with reasoning traces that can be validated and debugged by domain experts, though less specialized than dedicated symbolic reasoning systems or theorem provers.

Top Matches

Also Known As

Company