Llm Quantization Library

1

SmolLMModel58/100

via “quantized-model-inference-optimization”

Hugging Face's small model family for on-device use.

Unique: Provides multiple quantization variants (int8, int4) pre-quantized and tested, allowing developers to choose precision based on hardware constraints; quantization applied post-training without requiring retraining, enabling rapid deployment across device tiers

vs others: Pre-quantized variants eliminate need for custom quantization pipelines; int4 quantization enables deployment on devices where even 360M fp32 models don't fit; more practical than full-precision models for true mobile deployment

2

LlamafileCLI Tool57/100

via “quantization format conversion and model optimization”

Single-file executable LLMs — bundle model + inference, runs on any OS with zero install.

Unique: Supports importance matrix (imatrix) calculation for selective quantization, allowing different layers to use different bit-widths based on sensitivity, versus uniform quantization across all layers

vs others: More flexible quantization than fixed bit-width approaches because imatrix-guided quantization preserves quality in sensitive layers while aggressively quantizing less important layers

3

MLXFramework57/100

via “quantization-with-multiple-modes-and-backends”

Apple's ML framework for Apple Silicon — NumPy-like API, unified memory, LLM support.

Unique: Implements quantization with multiple modes (int4, int8, float16) and backend-specific optimizations for Metal and CUDA. Quantized operations handle dequantization transparently, enabling seamless integration with existing code.

vs others: More flexible than PyTorch's quantization because it supports multiple modes and backends; more integrated than external quantization tools because it's built into the framework.

4

SGLangFramework57/100

via “quantization with fp8, fp4, int8, and modelopt support”

Fast LLM/VLM serving — RadixAttention, prefix caching, structured output, automatic parallelism.

Unique: Provides a quantization registry that maps quantization types to optimized kernel implementations, with automatic fallback to slower kernels on unsupported hardware. Supports per-layer and per-channel quantization strategies with integrated calibration.

vs others: Supports more quantization schemes (FP8, FP4, INT8, MXFP4) than vLLM's INT8-only support, with optimized kernels for each scheme and automatic hardware-aware fallbacks.

5

OutlinesFramework57/100

via “quantized model support with llama.cpp integration”

Structured text generation — guarantees LLM outputs match JSON schemas or grammars.

Unique: Integrates token masking directly into llama.cpp's C++ inference loop, enabling efficient constrained generation on quantized models with minimal Python overhead.

vs others: Enables constrained generation on edge devices and low-resource environments where cloud APIs or full-precision models are impractical; reduces latency and cost for on-device inference.

6

TensorRT-LLMFramework57/100

via “multi-precision quantization with fp8, int4, awq, and gptq support”

NVIDIA's LLM inference optimizer — quantization, kernel fusion, maximum GPU performance.

Unique: Implements a unified quantization abstraction layer (QuantMethod interface) with pluggable backends for FP8, INT4, AWQ, and GPTQ, allowing per-layer quantization strategy selection during model compilation. Integrates directly with TensorRT's kernel fusion pipeline to eliminate quantization overhead in fused operations.

vs others: Tighter integration with TensorRT kernels than vLLM or llama.cpp, eliminating separate dequantization passes and enabling fused quantized operations that reduce memory bandwidth by 40-60% vs post-hoc quantization approaches.

7

AutoGPTQRepository55/100

GPTQ-based LLM quantization with fast CUDA inference.

Unique: AutoGPTQ stands out by providing easy-to-use APIs for quantizing models to various bit precisions, optimized for different hardware configurations.

vs others: Compared to other quantization libraries, AutoGPTQ offers a more user-friendly interface and supports a wider range of model architectures.

8

llama.cppRepository55/100

via “c/c++ library for llm inference”

C/C++ LLM inference — GGUF quantization, GPU offloading, foundation for local AI tools.

Unique: This artifact uniquely provides a dependency-free solution for LLM inference in C/C++, enabling broad compatibility across platforms.

vs others: Unlike other LLM frameworks, llama.cpp offers a lightweight, dependency-free approach that supports multiple GPU platforms and quantization formats.

9

exllamav2Repository24/100

via “gpu-accelerated llm inference with 4-bit quantization”

Python AI package: exllamav2

Unique: Custom CUDA kernel implementation with fused attention and 4-bit dequantization in-flight, avoiding intermediate tensor materialization — achieves 2-3x throughput vs llama.cpp on equivalent hardware by eliminating CPU-GPU sync points

vs others: Faster token generation than llama.cpp and vLLM for single-GPU setups due to hand-optimized kernels; lower memory footprint than HuggingFace transformers through aggressive quantization and KV cache optimization

10

Llama 3 (8B, 70B)Model24/100

via “quantization-transparent model distribution via ollama”

Meta's Llama 3 — foundational LLM for instruction-following

Unique: Ollama abstracts quantization format selection and hardware-aware optimization into the runtime, eliminating the need for users to manually download GGUF files, select quantization levels, or manage multiple model variants

vs others: Simpler than Hugging Face model downloads where users must manually select quantization variants, though less transparent than vLLM where quantization choices are explicit and documented

11

JanRepository23/100

via “model quantization and optimization”

Run LLMs like Mistral or Llama2 locally and offline on your computer, or connect to remote AI APIs. [#opensource](https://github.com/janhq/jan)

Unique: Automatically adjusts optimization techniques based on the user's hardware, providing tailored performance improvements.

vs others: More adaptive than static optimization tools, as it dynamically adjusts to the user's specific hardware capabilities.

12

llama-cpp-pythonRepository22/100

via “cpu-optimized llm inference with quantized model loading”

Python bindings for the llama.cpp library

Unique: Direct Python FFI bindings to llama.cpp's hand-optimized C++ inference engine with native support for GGUF quantization formats, avoiding the overhead of subprocess calls or REST APIs while exposing fine-grained control over sampling parameters, context window, and memory allocation

vs others: Faster and more memory-efficient than pure-Python implementations (Hugging Face Transformers) for quantized models, and lower latency than cloud API calls while maintaining full local control and privacy

13

QLoRA: Efficient Finetuning of Quantized LLMs (QLoRA)Product22/100

via “double quantization of quantization constants for nested compression”

* ⭐ 05/2023: [Voyager: An Open-Ended Embodied Agent with Large Language Models (Voyager)](https://arxiv.org/abs/2305.16291)

Unique: Introduces nested quantization where quantization constants themselves are quantized to 8-bit precision with separate scales, reducing constant overhead by 2-4x — prior quantization work treated constants as full-precision metadata, not subject to further compression

vs others: Reduces total model size by an additional 2-4% compared to single-level quantization, enabling 70B models to fit in 24GB memory where standard 4-bit quantization alone would require 28-32GB

14

Llama 2Model20/100

via “efficient inference with quantization and optimization”

The next generation of Meta's open source large language model. #opensource

15

LM StudioProduct

via “automatic-model-quantization”

Top Matches

Also Known As

Company