Arm Optimized Onnx Model Inference On Mobile Devices

1

LLM GuardFramework63/100

via “onnx model optimization for low-latency and resource-constrained deployment”

Open-source LLM input/output security scanner toolkit.

Unique: Provides configuration-driven ONNX optimization with quantization support (int8, float16) enabling 2-10x latency reduction; supports switching between full-precision and optimized models via configuration without code changes; enables deployment on CPU-only and edge devices where GPU acceleration is unavailable

vs others: Faster inference than PyTorch models because ONNX Runtime is optimized for inference; more flexible than fixed-optimization approaches because quantization level is configurable; enables deployment scenarios (edge, serverless, CPU-only) that would be infeasible with full-precision models

2

ONNX Runtime MobileFramework60/100

via “arm-optimized onnx model inference on mobile devices”

Cross-platform ONNX inference for mobile devices.

Unique: Implements ARM SIMD-aware graph execution with automatic operator partitioning — if a model operator isn't supported by the target accelerator (CoreML/NNAPI), the runtime intelligently falls back to CPU execution for that subgraph rather than failing entirely, enabling graceful degradation across heterogeneous device capabilities.

vs others: Faster than TensorFlow Lite on ARM for complex models because ONNX Runtime's graph optimization pipeline includes operator fusion and memory layout optimization, while TFLite's ARM backend is more conservative; more portable than native CoreML/NNAPI because ONNX format abstracts away iOS/Android differences.

3

TensorFlow LiteFramework60/100

via “on-device model inference with sub-100ms latency”

Lightweight ML inference for mobile and edge devices.

Unique: Optimized memory layout (row-major tensor storage) and single-pass interpreter design minimize cache misses and memory bandwidth. Uses pre-allocated tensor buffers (no dynamic allocation during inference) and platform-specific optimized kernels (ARM NEON intrinsics for mobile, Qualcomm Hexagon for NPU). Supports optional multi-threaded execution via configurable thread pool without requiring model recompilation.

vs others: Faster than TensorFlow full framework on mobile (10-50x speedup) due to optimized kernels and minimal overhead. Comparable latency to CoreML on iOS and NNAPI on Android, but more portable across platforms. Slower than specialized inference engines (TensorRT on NVIDIA, OpenVINO on Intel) due to broader hardware support and lack of per-device optimization.

4

Phi-3.5 MiniModel59/100

via “edge device and mobile deployment with onnx and gguf formats”

Microsoft's 3.8B model with 128K context for edge deployment.

Unique: Provides pre-optimized ONNX and GGUF formats specifically for cross-platform edge deployment, eliminating custom conversion and quantization work while supporting iOS, Android, and browser targets simultaneously from a single model artifact

vs others: Broader deployment target coverage than Llama 2 (primarily GGUF) or Mistral (primarily ONNX), with official support for mobile platforms and browsers enabling true offline-first applications without cloud fallback

5

Llama 3.2 90B VisionModel59/100

via “optimization for arm processors and mobile hardware”

Meta's largest open multimodal model at 90B parameters.

Unique: Provides explicit Arm processor optimizations for Qualcomm and MediaTek hardware, enabling mobile deployment through ExecuTorch with device-specific operator fusion rather than generic quantization

vs others: Hardware-specific optimizations enable better mobile performance than generic quantization approaches, though 90B model size likely requires smaller variants for practical mobile deployment

6

Llama 3.2 3BModel59/100

via “mobile and embedded device optimization with hardware acceleration”

Compact 3B model balancing capability with edge deployment.

Unique: Native ARM optimization with Qualcomm and MediaTek hardware acceleration enabled day one, plus ExecuTorch framework integration for quantized on-device inference — most 3B models lack mobile-specific optimizations or require generic CPU inference

vs others: Faster mobile inference than unoptimized models through hardware-specific kernels; smaller parameter count than 7B+ models enables sub-gigabyte memory footprint on mobile

7

Llama 3.2 11B VisionModel59/100

via “single-gpu local inference with edge/mobile optimization”

Meta's multimodal 11B model with text and vision.

Unique: Explicitly optimized for Arm processors and edge hardware (Qualcomm, MediaTek) from release, with native support via PyTorch ExecuTorch. 11B parameter footprint is 6-7x smaller than competing vision models (70B+), fitting within single-GPU and mobile memory constraints. Includes torchtune integration for local fine-tuning without cloud infrastructure.

vs others: Smaller model size enables local inference on consumer hardware without cloud dependency, while Arm optimization eliminates the need for x86-specific deployment pipelines used by larger models.

8

xlm-roberta-baseModel55/100

via “onnx model export and optimized inference”

fill-mask model by undefined. 1,81,65,674 downloads.

Unique: Provides native ONNX export support via HuggingFace Transformers, enabling single-command conversion to hardware-agnostic format with built-in optimization profiles for CPU, GPU, and mobile inference — unlike manual ONNX conversion which requires deep knowledge of ONNX IR and operator semantics

vs others: Reduces deployment complexity and inference latency compared to PyTorch/TensorFlow serving by eliminating framework dependencies and enabling aggressive quantization/pruning, while maintaining model accuracy through ONNX Runtime's operator fusion and memory optimization

9

bge-reranker-v2-m3Model54/100

via “quantization-and-model-compression-for-edge-deployment”

text-classification model by undefined. 98,81,128 downloads.

Unique: XLM-RoBERTa base model (110M parameters) is inherently smaller than larger alternatives, making quantization more effective; safetensors format enables efficient ONNX conversion with minimal overhead vs .bin format

vs others: Smaller base model (110M) quantizes more effectively than larger alternatives (300M+); ONNX support enables cross-platform deployment (CPU, mobile, edge) vs PyTorch-only models

10

bge-base-en-v1.5Model54/100

via “onnx-export-and-cpu-inference”

feature-extraction model by undefined. 81,55,394 downloads.

Unique: BGE-base-en-v1.5 provides official ONNX exports with optimized graph structure for inference runtimes, enabling sub-100ms CPU inference on modern processors and enabling deployment on edge devices without PyTorch or GPU requirements

vs others: Faster CPU inference than PyTorch eager execution and more portable than TorchScript for cross-platform deployment; enables embedding generation on edge devices where PyTorch is too heavy

11

table-transformer-detectionModel53/100

via “onnx model export for edge deployment and inference optimization”

object-detection model by undefined. 33,94,499 downloads.

Unique: Provides transformer-aware ONNX export that preserves attention mechanism semantics while enabling quantization-friendly operator fusion. The export pipeline includes automatic calibration for INT8 quantization using representative document images, reducing manual tuning overhead.

vs others: More portable than TensorFlow Lite or CoreML because ONNX Runtime runs on Windows, Linux, macOS, iOS, and Android with identical inference results; achieves better accuracy-latency tradeoffs than naive INT8 quantization due to transformer-specific calibration strategies.

12

multilingual-e5-smallModel53/100

via “onnx and openvino model export for edge deployment”

sentence-similarity model by undefined. 70,32,108 downloads.

Unique: Provides pre-optimized ONNX and OpenVINO representations of multilingual-e5-small, enabling single-model deployment across diverse hardware (CPUs, mobile, edge) without language-specific optimizations. OpenVINO export includes graph-level optimizations (operator fusion, constant folding) and quantization-aware training compatibility, reducing inference latency by 2-4x on Intel CPUs.

vs others: Smaller and faster than PyTorch deployment for edge use cases; more portable than TensorFlow Lite (which lacks transformer support); enables privacy-preserving on-device inference without cloud dependencies.

13

multi-qa-mpnet-base-dot-v1Model53/100

via “onnx-and-openvino-export-for-edge-deployment”

sentence-similarity model by undefined. 25,30,482 downloads.

Unique: Provides native ONNX and OpenVINO export support with quantization-friendly architecture (no custom ops). Enables deployment on edge devices and CPU-only infrastructure with minimal code changes, supporting both float32 and int8 quantized inference.

vs others: Faster edge deployment than PyTorch models because ONNX Runtime and OpenVINO use optimized inference engines with hardware-specific optimizations, and quantization support reduces model size by 4x and latency by 2-3x compared to full-precision models.

14

nexa-sdkFramework52/100

via “runtime performance optimization”

Run frontier LLMs and VLMs with day-0 model support across GPU, NPU, and CPU, with comprehensive runtime coverage for PC (Python/C++), mobile (Android & iOS), and Linux/IoT (Arm64 & x86 Docker). Supporting OpenAI GPT-OSS, IBM Granite-4, Qwen-3-VL, Gemma-3n, Ministral-3, and more.

Unique: Combines quantization and pruning techniques specifically tailored for LLMs, allowing for effective deployment on devices with limited resources.

vs others: More effective than standard frameworks that do not offer built-in optimization for large models on low-power devices.

15

multilingual-e5-baseModel51/100

via “onnx and openvino model export for edge deployment”

sentence-similarity model by undefined. 36,60,082 downloads.

Unique: Supports three inference backends (PyTorch, ONNX Runtime, OpenVINO) from a single model artifact, with automatic optimization for each target platform — ONNX for cross-platform compatibility, OpenVINO for Intel hardware, PyTorch for development

vs others: More portable than PyTorch-only deployment and faster than unoptimized ONNX due to OpenVINO's graph-level optimizations; enables 2-4x latency reduction on CPU compared to PyTorch inference

16

distil-large-v3Model51/100

via “onnx-export-and-cross-platform-inference”

automatic-speech-recognition model by undefined. 13,05,832 downloads.

Unique: Leverages ONNX's standardized opset to enable deployment across 10+ platforms (Windows, Linux, macOS, iOS, Android, web browsers, embedded systems) with a single model export — ONNX Runtime's execution providers automatically select optimal hardware acceleration (CPU, GPU, CoreML, NNAPI) without code changes

vs others: Enables true cross-platform deployment with a single model file, unlike PyTorch Mobile (iOS/Android only) or TensorFlow Lite (mobile-focused); ONNX Runtime's graph optimizations often match or exceed framework-native inference speed while providing broader platform coverage

17

bert-base-NERModel50/100

via “onnx export for edge deployment and inference optimization”

token-classification model by undefined. 18,11,113 downloads.

Unique: Supports ONNX export via transformers' built-in export utilities, enabling deployment on ONNX Runtime which provides hardware-specific optimizations (graph fusion, operator fusion, quantization) without retraining. ONNX models are framework-agnostic and can run on CPU, GPU, or specialized accelerators (NPU, TPU) via different ONNX Runtime backends.

vs others: Faster and smaller than PyTorch checkpoints due to graph optimization, and more portable than TensorFlow SavedModel, but requires additional conversion step and validation compared to native PyTorch deployment.

18

UAE-Large-V1Model49/100

via “onnx and openvino quantized inference for edge deployment”

feature-extraction model by undefined. 13,37,383 downloads.

Unique: Provides both ONNX and OpenVINO export formats with INT8 quantization pre-applied, enabling plug-and-play edge deployment without requiring custom quantization pipelines. Maintains <2% accuracy loss through careful calibration on representative text samples, unlike generic quantization approaches that often degrade embedding quality.

vs others: Faster edge inference than Sentence-BERT's standard PyTorch format (2-4x speedup via INT8) and more accessible than proprietary edge models like TensorFlow Lite, with no vendor lock-in.

19

RMBG-1.4Model48/100

via “onnx-based cross-platform inference without pytorch dependency”

image-segmentation model by undefined. 10,16,325 downloads.

Unique: Pre-exported ONNX model with inference-specific optimizations (operator fusion, memory layout optimization) reduces model size and latency compared to PyTorch eager execution; eliminates PyTorch dependency entirely, enabling deployment to platforms where PyTorch is unavailable or impractical

vs others: Smaller model size and faster inference than PyTorch on CPU; broader platform support than PyTorch Mobile (which is iOS/Android only); ONNX Runtime is more mature and widely supported than alternative inference engines like TensorFlow Lite for this use case

20

BiRefNetModel48/100

via “onnx export for cross-platform deployment”

image-segmentation model by undefined. 9,21,132 downloads.

Unique: Enables ONNX export of the bidirectional refinement architecture, preserving the multi-scale feature fusion and iterative refinement semantics in ONNX IR format, allowing deployment on non-PyTorch platforms while maintaining segmentation quality

vs others: Broader deployment flexibility than PyTorch-only models; ONNX Runtime provides faster CPU inference and better mobile/edge device support than PyTorch Mobile, though with some accuracy trade-off in quantized versions

Top Matches

Also Known As

Company