nexa-sdk
ModelFreeRun frontier LLMs and VLMs with day-0 model support across GPU, NPU, and CPU, with comprehensive runtime coverage for PC (Python/C++), mobile (Android & iOS), and Linux/IoT (Arm64 & x86 Docker). Supporting OpenAI GPT-OSS, IBM Granite-4, Qwen-3-VL, Gemma-3n, Ministral-3, and more.
Capabilities15 decomposed
cross-platform on-device llm inference with hardware-agnostic abstraction
Medium confidenceExecutes large language models locally across CPU, GPU, and NPU hardware through a layered architecture that abstracts hardware differences via a plugin system. The Go SDK provides type-safe interfaces (Create/Destroy lifecycle) that route inference requests through CGo bindings to C/C++ hardware plugins, enabling day-0 support for models like GPT-OSS, Granite-4, Qwen-3, and Llama-3 without cloud dependencies. Model formats (GGUF, MLX, NEXA) are handled by format-specific plugins that optimize for target hardware capabilities.
Plugin-based hardware abstraction layer (Layer 5) decouples model inference from hardware implementation, enabling day-0 support for new models and NPU architectures without SDK recompilation. CGo bridge (Layer 4) provides zero-copy memory management across language boundaries, critical for mobile/IoT where memory is constrained.
Supports NPU inference natively (Qualcomm, AMD, Intel) unlike Ollama or LM Studio which focus on GPU/CPU, and provides mobile SDKs (Android/iOS) that competitors lack, making it the only true cross-device inference framework.
vision-language model inference with multimodal input handling
Medium confidenceProcesses images and text together through VLM models (Qwen-3-VL, etc.) using a unified Go SDK interface that handles image encoding, tokenization, and vision-specific hardware optimizations. The VLM plugin system manages image preprocessing (resizing, normalization) and routes vision tokens through specialized hardware paths (GPU tensor cores for image encoding, NPU for attention). Supports batch image processing and maintains image context across multi-turn conversations.
VLM plugin architecture (runner/nexa-sdk/vlm.go) separates image encoding from text generation, allowing hardware-specific optimization of vision towers (GPU tensor cores for image embeddings) while text generation runs on NPU, maximizing throughput on heterogeneous hardware.
Only on-device VLM framework supporting NPU acceleration for vision encoding, whereas competitors (Ollama, LM Studio) run full VLM on single GPU, making it 3-5x more efficient on mobile/edge devices with heterogeneous compute.
python sdk with model lifecycle management and async inference
Medium confidenceProvides Python bindings to the Go SDK through a wrapper layer that exposes model classes (LLM, VLM, Embedder, etc.) with Create/Destroy lifecycle management. Supports both synchronous and asynchronous inference via asyncio, enabling concurrent model execution. Implements model caching and keepalive mechanisms to avoid reloading models between requests. Type hints and docstrings enable IDE autocomplete and documentation.
Python SDK wraps Go SDK with automatic model lifecycle management (Create/Destroy) and keepalive mechanisms, eliminating manual resource cleanup. Async support via asyncio enables concurrent inference without threading complexity.
Only Python SDK for on-device inference with native async support and automatic resource management, whereas Ollama Python client requires manual HTTP requests and LM Studio has no Python SDK, making it the most Pythonic on-device inference solution.
android sdk with native model inference and lifecycle management
Medium confidenceProvides Android-specific bindings to the Nexa inference engine through JNI (Java Native Interface) bridges. Implements model lifecycle management (Create/Destroy) with automatic cleanup on activity destruction. Supports both synchronous and asynchronous inference via Android's Executor framework. Handles Android-specific constraints (memory pressure, background execution, battery optimization) through lifecycle-aware components.
Android SDK implements lifecycle-aware components that automatically manage model memory based on Activity/Fragment lifecycle, preventing memory leaks and crashes. JNI bridge optimized for Android's memory constraints with aggressive garbage collection integration.
Only on-device inference SDK for Android with lifecycle-aware resource management and NPU support, whereas competitors (Ollama, LM Studio) have no mobile SDKs at all, making it the only true mobile-first on-device inference solution.
ios sdk with metal gpu acceleration and app extension support
Medium confidenceProvides iOS-specific bindings to the Nexa inference engine through Swift/Objective-C bridges. Implements Metal GPU acceleration for inference on Apple devices, leveraging GPU compute shaders for matrix operations. Supports iOS app extensions (Siri, keyboard, share) enabling inference in restricted execution contexts. Implements background task management for long-running inference with proper battery optimization.
iOS SDK leverages Metal GPU compute shaders for inference, achieving 2-3x speedup vs CPU on A-series chips. App extension support enables inference in restricted contexts (Siri, keyboard) through careful memory management and background task handling.
Only on-device inference SDK for iOS with native Metal GPU acceleration and app extension support, whereas competitors (Ollama, LM Studio) have no iOS SDKs at all, making it the only true iOS-native on-device inference solution.
docker containerization for linux/iot deployment with arm64 and x86 support
Medium confidenceProvides Docker images and containerization support for deploying Nexa on Linux servers and IoT devices. Supports both Arm64 (Raspberry Pi, Jetson, etc.) and x86-64 architectures with hardware-specific optimizations (CUDA for x86 GPU, NEON for Arm64 CPU). Implements multi-stage builds to minimize image size and includes pre-configured models for common use cases. Supports Docker Compose for orchestrating multi-model inference services.
Multi-architecture Docker images (Arm64 + x86) with hardware-specific optimizations (NEON for Arm64, CUDA for x86) in single image manifest, enabling seamless deployment across heterogeneous edge infrastructure. Multi-stage builds minimize image size while including pre-configured models.
Only on-device inference framework with native Arm64 Docker support and hardware-specific optimization, whereas Ollama and LM Studio focus on x86 GPU, making it the only true edge-device deployment solution for IoT and Raspberry Pi.
function calling with schema-based tool registry and multi-provider support
Medium confidenceImplements structured function calling through a schema-based tool registry that defines function signatures as JSON schemas. Supports OpenAI and Anthropic function-calling protocols natively, enabling agents to invoke external tools with type-safe arguments. The server middleware validates function calls against schemas, handles tool execution, and formats responses back to the model. Supports both synchronous tool execution and async tool chains.
Schema-based function registry (runner/server/service/) implements both OpenAI and Anthropic function-calling protocols with unified interface, enabling agents built for cloud APIs to execute local tools without adapter code. Middleware stack enables request/response transformation without modifying core inference.
Supports both OpenAI and Anthropic function-calling protocols natively, whereas Ollama has no function calling support and LM Studio requires manual JSON parsing, making it the only on-device framework enabling true multi-provider agent compatibility.
openai-compatible http server with function calling and streaming
Medium confidenceExposes local inference models via REST API endpoints that mirror OpenAI's chat completion and embedding APIs, enabling drop-in replacement of cloud LLM services. The server implements streaming responses (Server-Sent Events), function calling via schema-based function registry with native bindings for OpenAI/Anthropic APIs, and middleware for request validation, rate limiting, and response formatting. Built on Go HTTP server with configurable port and model routing.
Schema-based function registry (runner/server/service/) implements OpenAI and Anthropic function-calling protocols natively, allowing agents built for cloud APIs to execute local tools without adapter code. Middleware stack enables request/response transformation without modifying core inference logic.
Provides OpenAI API compatibility with function calling support, unlike Ollama which lacks structured tool calling, and unlike LM Studio which has no HTTP server at all, making it the only on-device framework that can replace cloud LLM APIs for agent workflows.
model hub integration with multi-source downloads and caching
Medium confidenceManages model lifecycle (discovery, download, caching, updates) across multiple model repositories (Hugging Face, ModelScope, Volces, S3, local filesystem) through a pluggable model hub system. Implements intelligent caching with file locking to prevent concurrent downloads, manifest tracking for version management, and automatic model updates. The store manager (runner/internal/store/) handles disk space management, model validation, and atomic file operations to ensure consistency across platform crashes.
Multi-source model hub abstraction (runner/internal/model_hub/) with pluggable backends (HuggingFace, ModelScope, Volces, S3, LocalFS) enables seamless switching between model sources without code changes. File locking mechanism (runner/internal/store/lock.go) prevents concurrent download corruption on shared filesystems, critical for mobile app distribution.
Supports 5+ model sources natively (HF, ModelScope, Volces, S3, local) with atomic file operations, whereas Ollama only supports HF and requires manual S3 setup, and LM Studio has no programmatic model management API.
text embedding generation with semantic search support
Medium confidenceGenerates dense vector embeddings for text using embedding models (e.g., BGE, ONNX-based embedders) through the embedder interface (runner/nexa-sdk/embedder.go). Embeddings are computed locally on GPU/NPU for privacy, supporting batch processing to amortize inference overhead. Integrates with vector databases via standard embedding output format (float32 arrays), enabling semantic search, similarity matching, and RAG pipeline construction without external embedding services.
Embedder plugin architecture (runner/nexa-sdk/embedder.go) supports both GGUF and ONNX formats with hardware-specific optimization paths (GPU tensor cores for matrix multiplication, NPU for attention), enabling 2-3x faster embedding generation than CPU-only alternatives.
Only on-device embedding framework with NPU acceleration support, whereas Ollama embeddings run on GPU only and require cloud APIs for NPU devices, making it the only true edge-compatible embedding solution.
reranking with cross-encoder models for retrieval refinement
Medium confidenceImplements cross-encoder reranking (runner/nexa-sdk/reranker.go) to refine retrieval results by scoring query-document pairs jointly, improving RAG pipeline precision. Rerankers take query and candidate documents as input, compute relevance scores, and return ranked results. Operates on GPU/NPU for efficient batch scoring of large result sets, supporting both pointwise (single score per document) and pairwise (comparative scoring) reranking strategies.
Reranker plugin supports both pointwise and pairwise scoring strategies with hardware-specific batch optimization, allowing developers to trade off latency vs precision by adjusting batch size and ranking strategy without code changes.
Provides on-device reranking with NPU acceleration, whereas most RAG frameworks (LangChain, LlamaIndex) rely on cloud reranking APIs (Cohere, Jina) or CPU-only local implementations, making it the only edge-compatible reranking solution.
image generation with stable diffusion and latent diffusion models
Medium confidenceGenerates images from text prompts using Stable Diffusion and compatible latent diffusion models through a dedicated image generation plugin. Implements the full diffusion pipeline (text encoding, latent diffusion, VAE decoding) with hardware-specific optimizations for GPU/NPU. Supports various sampling strategies (DDPM, DDIM, Euler), LoRA adapters for style transfer, and negative prompts for quality control. Outputs PNG/JPEG images with configurable resolution and quality parameters.
Image generation plugin architecture separates text encoding (CLIP), latent diffusion, and VAE decoding into independent stages, enabling hardware-specific routing (text encoding on NPU, diffusion on GPU, VAE on CPU) for heterogeneous device optimization.
Only on-device image generation framework supporting NPU acceleration for text encoding and diffusion steps, whereas Ollama lacks image generation entirely and Stable Diffusion WebUI runs on GPU only, making it the only true edge-compatible image generation solution.
text-to-speech synthesis with streaming audio output
Medium confidenceConverts text to natural-sounding speech using TTS models through the audio processing plugin system. Implements streaming audio generation where speech is synthesized incrementally and output as audio chunks (WAV, MP3), enabling real-time playback without waiting for full synthesis. Supports multiple voices, speaking rates, and prosody control. Hardware acceleration on GPU/NPU speeds up mel-spectrogram generation and vocoder inference.
Streaming TTS architecture (runner/nexa-sdk/audio.go) generates audio chunks incrementally, enabling real-time playback while synthesis continues, unlike batch TTS which requires waiting for full synthesis. Hardware acceleration on GPU/NPU for mel-spectrogram generation reduces latency by 3-5x.
Only on-device TTS framework with streaming output and NPU acceleration, whereas Ollama lacks TTS entirely and cloud TTS APIs (Google, Amazon) require network round-trips, making it the only solution for real-time voice synthesis on edge devices.
automatic speech recognition with streaming audio input
Medium confidenceTranscribes audio to text using ASR models (Whisper, etc.) through the audio processing plugin system. Supports streaming transcription where audio chunks are processed incrementally, enabling real-time speech-to-text without waiting for full audio. Implements voice activity detection (VAD) to skip silence, reducing computation. Outputs text with optional timestamps and confidence scores. Hardware acceleration on GPU/NPU speeds up acoustic model inference.
Streaming ASR architecture with voice activity detection (VAD) processes audio incrementally and skips silence, reducing computation by 30-50% vs batch processing. Hardware acceleration on GPU/NPU for acoustic model inference enables real-time transcription on mobile devices.
Only on-device ASR framework with streaming input and VAD, whereas Ollama lacks ASR entirely and cloud ASR APIs (Google, Amazon) require network latency, making it the only solution for real-time speech recognition on edge devices without internet.
command-line interface with interactive repl and model management
Medium confidenceProvides a comprehensive CLI (runner/cmd/nexa-cli/) for model discovery, download, inference, and server management. Implements an interactive REPL mode for testing models with multi-turn conversations, model listing/info commands, and server startup. The CLI routes commands through the core orchestration layer (Layer 2) which parses arguments and dispatches to appropriate Go SDK methods. Supports both one-shot inference (nexa run model 'prompt') and interactive sessions (nexa infer model).
Interactive REPL mode (runner/cmd/nexa-cli/infer.go) maintains conversation state across turns, enabling multi-turn testing without reloading models. Command routing through core orchestration layer (Layer 2) ensures CLI and SDK share identical inference logic.
Provides interactive REPL with multi-turn conversation support, whereas Ollama CLI is one-shot only and LM Studio has no CLI at all, making it the most developer-friendly on-device inference CLI.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with nexa-sdk, ranked by overlap. Discovered automatically through the match graph.
ollama
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
LLaVA Llama 3 (8B)
LLaVA on Llama 3 — improved vision-language on Llama 3 backbone — vision-capable
onnxruntime
ONNX Runtime is a runtime accelerator for Machine Learning models
MLX
Apple's ML framework for Apple Silicon — NumPy-like API, unified memory, LLM support.
Ollama
Get up and running with large language models locally.
LM Studio
Download and run local LLMs on your computer.
Best For
- ✓Privacy-conscious developers building LLM applications for regulated industries
- ✓Mobile app developers targeting Android/iOS with on-device AI
- ✓IoT/edge computing teams deploying inference on Arm64 or x86 Docker containers
- ✓Teams requiring zero-latency inference or offline-first architectures
- ✓Healthcare/legal teams processing sensitive documents with privacy requirements
- ✓Mobile app developers building image analysis features (photo search, accessibility)
- ✓Robotics/autonomous systems teams needing real-time visual reasoning
- ✓Content moderation platforms requiring on-device image understanding
Known Limitations
- ⚠Plugin system adds abstraction overhead (~50-100ms per inference call depending on hardware bridge complexity)
- ⚠Model quantization (GGUF format) may reduce accuracy vs full-precision cloud models by 1-3% on benchmarks
- ⚠NPU support limited to Qualcomm, AMD, and Intel architectures — no support for Apple Neural Engine for LLM inference
- ⚠Memory constraints on mobile devices limit model size to ~7B parameters effectively
- ⚠Image resolution limited by model architecture (typically 1024x1024 max) — larger images require tiling or downsampling
- ⚠Vision encoding adds 200-500ms latency per image depending on resolution and hardware
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
Repository Details
Last commit: Apr 14, 2026
About
Run frontier LLMs and VLMs with day-0 model support across GPU, NPU, and CPU, with comprehensive runtime coverage for PC (Python/C++), mobile (Android & iOS), and Linux/IoT (Arm64 & x86 Docker). Supporting OpenAI GPT-OSS, IBM Granite-4, Qwen-3-VL, Gemma-3n, Ministral-3, and more.
Categories
Alternatives to nexa-sdk
Are you the builder of nexa-sdk?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →