Capability
Cost Optimized Inference Serving
20 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “efficient-cpu-and-edge-inference”
sentence-similarity model by undefined. 3,42,53,353 downloads.
Unique: Provides pre-optimized ONNX and OpenVINO artifacts with quantization-friendly architecture (no custom ops, standard transformer layers) enabling efficient CPU inference; 438MB model size is 2-3x smaller than full-size BERT variants while maintaining competitive accuracy
vs others: Achieves 5-10x lower inference cost than GPU-based embeddings on serverless platforms (AWS Lambda: $0.0000002/invocation vs $0.0001+ for GPU) while maintaining 85-95% of GPU inference quality through ONNX optimization