Onnx Quantized Model Inference For Edge And Browser Deployment

1

ONNX Runtime MobileFramework58/100

via “onnx model inference engine for mobile and edge devices”

Cross-platform ONNX inference for mobile devices.

Unique: Optimized for mobile and edge devices, enabling efficient inference with various execution providers.

vs others: Offers a unique focus on mobile optimization compared to other general-purpose inference engines.

2

Phi-3.5 MiniModel58/100

via “edge device and mobile deployment with onnx and gguf formats”

Microsoft's 3.8B model with 128K context for edge deployment.

Unique: Provides pre-optimized ONNX and GGUF formats specifically for cross-platform edge deployment, eliminating custom conversion and quantization work while supporting iOS, Android, and browser targets simultaneously from a single model artifact

vs others: Broader deployment target coverage than Llama 2 (primarily GGUF) or Mistral (primarily ONNX), with official support for mobile platforms and browsers enabling true offline-first applications without cloud fallback

3

LLM GuardFramework57/100

via “onnx model optimization for low-latency and resource-constrained deployment”

Open-source LLM input/output security scanner toolkit.

Unique: Provides configuration-driven ONNX optimization with quantization support (int8, float16) enabling 2-10x latency reduction; supports switching between full-precision and optimized models via configuration without code changes; enables deployment on CPU-only and edge devices where GPU acceleration is unavailable

vs others: Faster inference than PyTorch models because ONNX Runtime is optimized for inference; more flexible than fixed-optimization approaches because quantization level is configurable; enables deployment scenarios (edge, serverless, CPU-only) that would be infeasible with full-precision models

4

Qwen3-4B-Instruct-2507Model55/100

via “efficient inference on edge devices through quantization and model optimization”

text-generation model by undefined. 1,06,91,206 downloads.

Unique: Qwen3-4B's 4B parameter scale is already optimized for edge deployment; supports multiple quantization formats (GPTQ, AWQ, GGML) enabling flexibility across deployment targets; grouped query attention reduces KV cache size by 4-8x compared to standard attention

vs others: Smaller base model than Llama 3.2-7B makes quantization more effective; better quality than TinyLlama at similar quantized size; requires less custom optimization than Phi-2 due to more mature quantization ecosystem

5

bert-base-uncasedModel55/100

via “model quantization and compression for edge deployment”

fill-mask model by undefined. 5,92,18,905 downloads.

Unique: Post-training quantization via ONNX Runtime or PyTorch quantization APIs requires no retraining while achieving 4x model size reduction; supports multiple quantization schemes (symmetric, asymmetric, per-channel) for fine-grained accuracy-efficiency control

vs others: Simpler than quantization-aware training (no retraining required) and more portable than framework-specific quantization due to ONNX support

6

bge-m3Model54/100

via “onnx model export for edge and serverless deployment”

sentence-similarity model by undefined. 2,04,74,507 downloads.

Unique: Pre-optimized ONNX export with native quantization support and operator fusion for CPU inference, reducing deployment complexity compared to manual PyTorch-to-ONNX conversion while maintaining embedding quality through careful quantization calibration

vs others: Simpler than custom ONNX conversion pipelines and includes pre-tuned quantization profiles, whereas generic PyTorch-to-ONNX export requires manual optimization; reduces cold-start latency by 60-80% vs PyTorch Lambda deployments

7

xlm-roberta-baseModel54/100

via “onnx model export and optimized inference”

fill-mask model by undefined. 1,81,65,674 downloads.

Unique: Provides native ONNX export support via HuggingFace Transformers, enabling single-command conversion to hardware-agnostic format with built-in optimization profiles for CPU, GPU, and mobile inference — unlike manual ONNX conversion which requires deep knowledge of ONNX IR and operator semantics

vs others: Reduces deployment complexity and inference latency compared to PyTorch/TensorFlow serving by eliminating framework dependencies and enabling aggressive quantization/pruning, while maintaining model accuracy through ONNX Runtime's operator fusion and memory optimization

8

bge-reranker-v2-m3Model53/100

via “quantization-and-model-compression-for-edge-deployment”

text-classification model by undefined. 98,81,128 downloads.

Unique: XLM-RoBERTa base model (110M parameters) is inherently smaller than larger alternatives, making quantization more effective; safetensors format enables efficient ONNX conversion with minimal overhead vs .bin format

vs others: Smaller base model (110M) quantizes more effectively than larger alternatives (300M+); ONNX support enables cross-platform deployment (CPU, mobile, edge) vs PyTorch-only models

9

GLM-OCRModel53/100

via “model quantization and efficient inference deployment”

image-to-text model by undefined. 83,58,592 downloads.

Unique: Implements quantization-aware training with document-specific calibration, achieving 3-4x speedup and 3.5x model size reduction while maintaining 98-99% accuracy compared to full-precision baseline

vs others: More practical than knowledge distillation for deployment because it preserves the original model architecture, while being more efficient than full-precision inference for resource-constrained environments

10

bge-base-en-v1.5Model53/100

via “onnx-export-and-cpu-inference”

feature-extraction model by undefined. 81,55,394 downloads.

Unique: BGE-base-en-v1.5 provides official ONNX exports with optimized graph structure for inference runtimes, enabling sub-100ms CPU inference on modern processors and enabling deployment on edge devices without PyTorch or GPU requirements

vs others: Faster CPU inference than PyTorch eager execution and more portable than TorchScript for cross-platform deployment; enables embedding generation on edge devices where PyTorch is too heavy

11

multi-qa-mpnet-base-dot-v1Model52/100

via “onnx-and-openvino-export-for-edge-deployment”

sentence-similarity model by undefined. 25,30,482 downloads.

Unique: Provides native ONNX and OpenVINO export support with quantization-friendly architecture (no custom ops). Enables deployment on edge devices and CPU-only infrastructure with minimal code changes, supporting both float32 and int8 quantized inference.

vs others: Faster edge deployment than PyTorch models because ONNX Runtime and OpenVINO use optimized inference engines with hardware-specific optimizations, and quantization support reduces model size by 4x and latency by 2-3x compared to full-precision models.

12

table-transformer-detectionModel52/100

via “onnx model export for edge deployment and inference optimization”

object-detection model by undefined. 33,94,499 downloads.

Unique: Provides transformer-aware ONNX export that preserves attention mechanism semantics while enabling quantization-friendly operator fusion. The export pipeline includes automatic calibration for INT8 quantization using representative document images, reducing manual tuning overhead.

vs others: More portable than TensorFlow Lite or CoreML because ONNX Runtime runs on Windows, Linux, macOS, iOS, and Android with identical inference results; achieves better accuracy-latency tradeoffs than naive INT8 quantization due to transformer-specific calibration strategies.

13

multilingual-e5-smallModel52/100

via “onnx and openvino model export for edge deployment”

sentence-similarity model by undefined. 70,32,108 downloads.

Unique: Provides pre-optimized ONNX and OpenVINO representations of multilingual-e5-small, enabling single-model deployment across diverse hardware (CPUs, mobile, edge) without language-specific optimizations. OpenVINO export includes graph-level optimizations (operator fusion, constant folding) and quantization-aware training compatibility, reducing inference latency by 2-4x on Intel CPUs.

vs others: Smaller and faster than PyTorch deployment for edge use cases; more portable than TensorFlow Lite (which lacks transformer support); enables privacy-preserving on-device inference without cloud dependencies.

14

multilingual-e5-baseModel51/100

via “onnx and openvino model export for edge deployment”

sentence-similarity model by undefined. 36,60,082 downloads.

Unique: Supports three inference backends (PyTorch, ONNX Runtime, OpenVINO) from a single model artifact, with automatic optimization for each target platform — ONNX for cross-platform compatibility, OpenVINO for Intel hardware, PyTorch for development

vs others: More portable than PyTorch-only deployment and faster than unoptimized ONNX due to OpenVINO's graph-level optimizations; enables 2-4x latency reduction on CPU compared to PyTorch inference

15

wav2vec2-large-xlsr-53-portugueseModel51/100

via “model quantization and compression for edge deployment”

automatic-speech-recognition model by undefined. 34,53,044 downloads.

Unique: Quantization is not built into the model — requires external tools (torch.quantization, ONNX Runtime) and custom validation. The wav2vec2 architecture (with feature extraction and attention) presents unique quantization challenges not present in simpler models.

vs others: More flexible than pre-quantized models (allows custom quantization strategies); more challenging than models with built-in quantization support (e.g., TensorFlow Lite models); comparable to other wav2vec2 quantization approaches but requires Portuguese-specific validation to ensure accuracy.

16

all-MiniLM-L6-v2Model50/100

via “quantized-model-inference”

feature-extraction model by undefined. 32,39,437 downloads.

Unique: 8-bit integer quantization reduces model size by 75% while maintaining <2% semantic similarity accuracy loss — ONNX Runtime's transparent dequantization means applications see identical float32 outputs without code changes, making optimization invisible to users

vs others: Smaller and faster than full-precision all-MiniLM-L12-v2 (90MB → 22MB, 2-4x speedup); better accuracy than more aggressive quantization schemes (4-bit, binary) while maintaining similar size benefits; superior to knowledge distillation because it preserves the original model architecture

17

UAE-Large-V1Model49/100

via “onnx and openvino quantized inference for edge deployment”

feature-extraction model by undefined. 13,37,383 downloads.

Unique: Provides both ONNX and OpenVINO export formats with INT8 quantization pre-applied, enabling plug-and-play edge deployment without requiring custom quantization pipelines. Maintains <2% accuracy loss through careful calibration on representative text samples, unlike generic quantization approaches that often degrade embedding quality.

vs others: Faster edge inference than Sentence-BERT's standard PyTorch format (2-4x speedup via INT8) and more accessible than proprietary edge models like TensorFlow Lite, with no vendor lock-in.

18

e5-base-v2Model49/100

via “onnx and openvino model export for edge and on-premise deployment”

sentence-similarity model by undefined. 17,78,169 downloads.

Unique: Provides native ONNX and OpenVINO export through sentence-transformers' built-in conversion utilities, supporting both full-precision and quantized models without custom export code. The export process preserves the tokenizer and preprocessing logic, enabling end-to-end inference without reimplementing text preprocessing.

vs others: One-command export to multiple formats (ONNX, OpenVINO) with quantization support, whereas most models require separate conversion pipelines and manual tokenizer integration for edge deployment.

19

bert-large-cased-finetuned-conll03-englishFine-tune49/100

via “model quantization and compression for edge deployment”

token-classification model by undefined. 11,08,389 downloads.

Unique: Model is compatible with standard quantization pipelines (ONNX Runtime, TensorFlow Lite, PyTorch quantization) but lacks built-in quantization-aware training; users must apply post-training quantization with manual accuracy validation

vs others: Quantization reduces model size by 70-75% compared to uncompressed FP32; faster than BERT-base on CPU due to larger capacity offsetting quantization overhead; more accurate than distilled models (DistilBERT) on formal English text despite similar inference speed

20

Qwen3-ASR-1.7BModel49/100

via “quantized-inference-for-edge-deployment”

automatic-speech-recognition model by undefined. 18,69,130 downloads.

Unique: Qwen3-ASR provides pre-optimized quantization profiles for common edge devices (ARM64, x86, mobile) via ONNX Runtime, with published accuracy benchmarks showing <2% WER degradation at INT8 and <5% at INT4. The model's 1.7B size is already optimized for quantization, unlike larger models that suffer more accuracy loss.

vs others: Smaller base model size (1.7B) means quantization overhead is lower than Whisper-large; achieves better accuracy-to-latency ratio on edge devices, but requires more manual optimization than cloud APIs which handle quantization transparently

Top Matches

Also Known As

Company