Inference On Cpu With Reduced Precision

1

ChatGLM-4Model57/100

via “cpu-based inference with reduced precision”

Tsinghua's bilingual dialogue model.

Unique: Supports CPU inference through INT8 quantization and memory-mapped file loading without requiring GPU-specific optimizations, enabling deployment on any machine with sufficient RAM

vs others: More accessible than GPU-required models for developers without hardware; INT8 quantization reduces memory to 8GB, making it feasible on modest laptops, though inference speed is significantly slower

2

Qwen2.5-3B-InstructModel54/100

via “efficient inference on consumer hardware with cpu fallback”

text-generation model by undefined. 92,07,977 downloads.

Unique: Combines grouped-query attention (reducing KV cache size) with quantization support and CPU-optimized inference frameworks (llama.cpp, ONNX Runtime) to enable practical inference on consumer CPUs — a design pattern that prioritizes accessibility over peak performance

vs others: More practical on CPU than Llama 2 7B due to smaller parameter count; less capable than cloud-based APIs but enables offline operation and data privacy

3

mask2former-swin-large-cityscapes-semanticModel46/100

image-segmentation model by undefined. 1,55,904 downloads.

Unique: Supports standard PyTorch quantization APIs without model-specific modifications, enabling straightforward CPU deployment — though deformable attention operations may not be optimized for CPU execution

vs others: Enables CPU deployment without retraining, though 10-20x latency penalty makes it unsuitable for latency-critical applications vs GPU deployment

4

stable-diffusion-webui-dockerRepository45/100

via “cpu-only stable diffusion inference with precision downsampling”

Easy Docker setup for Stable Diffusion with user-friendly UI

Unique: Explicitly disables half-precision inference (--no-half) and forces full precision (--precision full) in the container entrypoint, a deliberate architectural choice to maximize CPU numerical stability. Shares identical volume mounts and Gradio UI with GPU variant, enabling seamless fallback without code changes.

vs others: More accessible than GPU-only solutions for developers without hardware, but 50x slower than GPU inference and 10x slower than optimized CPU libraries like ONNX Runtime with quantization

5

yolos-tinyModel40/100

via “inference on cpu with quantization support for resource-constrained environments”

object-detection model by undefined. 83,525 downloads.

Unique: Supports both FP32 CPU inference (standard PyTorch) and INT8 quantization via torch.quantization, enabling flexible accuracy-latency tradeoffs; tiny model variant is optimized for CPU memory footprint

vs others: Simpler quantization workflow than TensorFlow Lite (no custom conversion), but slower CPU inference than ONNX Runtime with optimized CPU providers

Top Matches

Also Known As

Company