Lightweight Mobile Vision Transformer Image Classification

1

TransformersRepository55/100

via “vision transformer and cnn-based image classification with transfer learning”

Hugging Face's model library — thousands of pretrained transformers for NLP, vision, audio.

Unique: Provides both Vision Transformer and CNN-based models with unified API, supporting transfer learning by freezing early layers. ImageProcessor handles model-specific preprocessing automatically.

vs others: More flexible than torchvision models because it supports Vision Transformers in addition to CNNs. More convenient than manual transfer learning because layer freezing and fine-tuning are built-in.

2

mobilenetv3_small_100.lamb_in1kModel54/100

via “lightweight-image-classification-inference”

image-classification model by undefined. 2,28,10,638 downloads.

Unique: Uses inverted residual blocks with squeeze-and-excitation (SE) modules and non-linear bottleneck layers, achieving state-of-the-art accuracy-to-parameter ratio (75.7% top-1 on ImageNet with 2.5M params). Trained with LAMB optimizer on ImageNet-1k, enabling faster convergence than SGD-based alternatives. Distributed via timm's unified model registry with automatic weight downloading and format conversion (PyTorch → ONNX → TensorRT).

vs others: Outperforms EfficientNet-B0 and SqueezeNet on latency-accuracy tradeoff for mobile inference; 3-5× faster than ResNet-50 on ARM devices while maintaining competitive accuracy for general-purpose classification.

3

fairface_age_image_detectionModel53/100

via “vision transformer patch-based feature extraction”

image-classification model by undefined. 63,65,110 downloads.

Unique: Uses google/vit-base-patch16-224-in21k as foundation, which was pre-trained on ImageNet-21k (14M images) before fine-tuning on FairFace, providing strong initialization for age-relevant features. The 16x16 patch size balances between capturing fine facial details and maintaining computational efficiency, with 197 total tokens (196 patches + 1 class token).

vs others: Captures long-range facial dependencies better than CNN-based age classifiers because self-attention can directly relate distant facial regions; more parameter-efficient than stacking deep CNN layers while maintaining or exceeding accuracy on age classification benchmarks.

4

vit-base-patch16-224Model51/100

via “patch-based image classification with vision transformer architecture”

image-classification model by undefined. 47,71,224 downloads.

Unique: Uses pure transformer architecture (no convolutional layers) with learnable patch embeddings and positional encodings, enabling efficient global receptive field from the first layer and superior transfer learning compared to CNN-based models; trained on both ImageNet-1k (1.3M images) and ImageNet-21k (14M images) for enhanced feature representations

vs others: Outperforms ResNet-50 and EfficientNet-B0 on ImageNet accuracy (84.0% vs 76.1% and 77.1%) while maintaining comparable inference speed, and provides better transfer learning performance on downstream tasks due to transformer's global attention mechanism

5

vit-base-nsfw-detectorModel49/100

via “vision transformer-based nsfw image classification”

image-classification model by undefined. 14,37,835 downloads.

Unique: Uses Vision Transformer patch-based architecture (16x16 patches) instead of CNN-based approaches like ResNet, enabling global context modeling across the entire image through self-attention mechanisms. Distributed in both ONNX and safetensors formats with quantization, allowing deployment flexibility from browser (transformers.js) to edge devices to cloud inference.

vs others: Faster inference than full-precision ViT models and more semantically robust than traditional CNN-based NSFW detectors due to transformer attention, while remaining open-source and deployable without external APIs unlike commercial solutions (AWS Rekognition, Google Vision API).

6

gender-classificationModel48/100

via “vision transformer-based binary gender classification from images”

image-classification model by undefined. 11,95,698 downloads.

Unique: Uses Vision Transformer (ViT) architecture with patch-based tokenization instead of traditional CNN backbones (ResNet, EfficientNet), enabling better capture of global gender-related visual patterns through multi-head self-attention across image regions. Distributed via HuggingFace's safetensors format for faster, safer model loading compared to pickle-based PyTorch checkpoints.

vs others: Faster inference than ensemble CNN models and more interpretable attention patterns than black-box CNNs, though potentially less robust to occlusion than specialized face-detection-first pipelines like MediaPipe + gender classifier combinations.

7

mobilevit-smallModel47/100

image-classification model by undefined. 27,81,568 downloads.

Unique: Uses a hybrid local-to-global architecture combining depthwise separable convolutions for local feature extraction with multi-head self-attention for global context, achieving 78.3% ImageNet-1k accuracy with 5.6M parameters — significantly smaller than ViT-Base (86M params) while maintaining transformer expressiveness for mobile deployment

vs others: Outperforms MobileNetV3 (77.2% accuracy) with comparable model size while offering superior transfer learning capabilities due to transformer components; lighter than EfficientNet-B0 (77.1%, 5.3M params) with better accuracy-to-latency tradeoff on ARM processors

8

yolos-smallModel46/100

via “vision transformer-based object detection with patch tokenization”

object-detection model by undefined. 7,35,352 downloads.

Unique: Uses pure Vision Transformer architecture with patch-based tokenization (no CNN backbone) for object detection, treating detection as a sequence-to-sequence task rather than region-proposal-based approach. Implements efficient attention mechanisms that scale better to high-resolution images than traditional ViT by using adaptive patch merging.

vs others: Faster inference than standard ViT-based detectors due to optimized patch tokenization, but trades accuracy for speed compared to Faster R-CNN; better suited for edge deployment than Mask R-CNN while maintaining transformer composability with language models

9

segformer-b0-finetuned-ade-512-512Fine-tune46/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 3,13,332 downloads.

Unique: SegFormer-B0 uses a pure transformer encoder with hierarchical shifted window attention and linear decoder (not convolutional) to achieve 3.75M parameters while maintaining competitive accuracy — significantly smaller than DeepLabV3+ (59M params) or PSPNet (46M params) while using modern attention mechanisms instead of dilated convolutions for receptive field expansion

vs others: Smallest transformer-based semantic segmentation model available on HuggingFace with pre-trained ADE20K weights, enabling deployment on mobile/edge devices where DeepLabV3+ and PSPNet are too large, while maintaining transformer-based architectural advantages over CNN-only alternatives

10

vit_base_patch16_224.augreg2_in21k_ft_in1kModel45/100

via “vision transformer patch-based image classification with imagenet-1k fine-tuning”

image-classification model by undefined. 5,01,255 downloads.

Unique: Combines ImageNet-21K pre-training (14K classes) with ImageNet-1K fine-tuning using AugReg regularization strategy, achieving superior generalization compared to models trained only on ImageNet-1K; patch-based tokenization (16×16) enables pure transformer architecture without convolutions, allowing efficient scaling and better long-range dependency modeling than CNNs

vs others: Outperforms ResNet-50 and EfficientNet-B4 on ImageNet-1K accuracy (84.7% vs 76-82%) while maintaining competitive inference speed; superior to ViT-Base trained only on ImageNet-1K due to ImageNet-21K pre-training providing richer feature initialization

11

oneformer_ade20k_swin_tinyModel45/100

via “lightweight-swin-tiny-backbone-inference”

image-segmentation model by undefined. 2,48,429 downloads.

Unique: Swin Tiny backbone uses hierarchical window-based self-attention (shifted windows across 4 stages) to achieve O(n log n) complexity instead of O(n²), reducing FLOPs by 60% vs ViT-Base while maintaining competitive accuracy. Parameter count of 28M is 3× smaller than Swin Base (87M), enabling deployment to edge devices.

vs others: Faster inference than ResNet-based backbones (e.g., ResNet50) on modern hardware due to better GPU utilization of attention operations; smaller than Swin Base/Large while maintaining hierarchical feature extraction that CNNs lack, making it ideal for edge deployment.

12

convnextv2_nano.fcmae_ft_in22k_in1kModel45/100

via “image classification with convnextv2 architecture”

image-classification model by undefined. 17,09,644 downloads.

Unique: The model is fine-tuned using the FCMAE (Feature Contrastive Masked Autoencoder) approach, which enhances its ability to learn robust features from images, setting it apart from standard models that do not incorporate such advanced techniques.

vs others: More efficient than traditional CNNs for image classification tasks due to its lightweight architecture and advanced feature learning capabilities.

13

segformer-b0-finetuned-ade-512-512Fine-tune44/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 5,08,692 downloads.

Unique: Lightweight B0 variant (3.7M parameters) with hierarchical transformer encoder enables efficient client-side inference via ONNX, avoiding cloud API calls; pre-quantized to 8-bit reduces model size to ~15MB while maintaining ADE20K accuracy within 2-3% of original

vs others: Smaller and faster than DeepLabV3+ (59M params) for browser deployment, more accurate than FCN-based segmentation on complex indoor scenes due to transformer attention, and open-source unlike proprietary cloud APIs (Google Vision, AWS Rekognition)

14

oneformer_ade20k_swin_largeModel44/100

via “swin-transformer-hierarchical-feature-extraction”

image-segmentation model by undefined. 90,906 downloads.

Unique: Implements shifted window attention (W-MSA and SW-MSA) that restricts self-attention to local windows of size 7×7, reducing complexity from O(N²) to O(N·w²) where w=7. This enables processing of high-resolution images while maintaining global receptive field through cross-window connections across stages.

vs others: Achieves 3-5× faster inference than ViT-Base on dense tasks while maintaining comparable or better accuracy due to hierarchical design and local attention efficiency, making it practical for real-time segmentation where vanilla ViT would be prohibitively slow.

15

segformer-b5-finetuned-ade-640-640Fine-tune43/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 61,096 downloads.

Unique: Uses SegFormer architecture with hierarchical transformer encoder (B5 variant with 48M parameters) and lightweight MLP decoder instead of dense convolutional decoders, enabling efficient multi-scale feature fusion without expensive upsampling operations. Fine-tuned on ADE20K's 150 semantic classes with 640x640 resolution optimization, achieving state-of-the-art mIoU on scene parsing benchmarks while maintaining inference efficiency.

vs others: Outperforms DeepLabV3+ and PSPNet on ADE20K scene parsing (mIoU ~50%) while using 3-5x fewer parameters due to transformer efficiency; faster inference than ViT-based segmentation approaches due to hierarchical design, but slower than lightweight MobileNet-based segmenters for resource-constrained deployment.

16

segformer-b1-finetuned-ade-512-512Fine-tune43/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 1,77,465 downloads.

Unique: Uses hierarchical vision transformer (SegFormer) with all-MLP decoder instead of convolutional decoders, enabling efficient multi-scale feature fusion without expensive upsampling operations. Fine-tuned on ADE20K's 150 semantic classes (vs COCO's 80 or Cityscapes' 19) providing richer scene understanding for indoor/outdoor environments.

vs others: Faster inference and lower memory than DeepLabv3+ (ResNet backbone) while maintaining competitive mIoU; more efficient than ViT-based segmentation due to hierarchical design; outperforms FCN/U-Net on complex scene parsing due to transformer's global receptive field.

17

rorshark-vit-baseModel42/100

via “vision transformer-based image classification with imagenet-21k pretraining”

image-classification model by undefined. 6,53,291 downloads.

Unique: Fine-tuned from Google's ViT-base-patch16-224-in21k (ImageNet-21k pretraining on 14k classes) rather than ImageNet-1k, providing stronger initialization for diverse downstream tasks and better generalization to out-of-distribution images. Uses patch-based tokenization (16×16) instead of CNN feature hierarchies, enabling global receptive fields from the first layer and more efficient scaling to high-resolution inputs.

vs others: Outperforms ResNet-50 and EfficientNet-B4 on transfer learning benchmarks with fewer parameters (86M vs 25M-388M), and matches or exceeds CLIP-based classifiers on domain-specific tasks while being 3-5x faster to fine-tune due to smaller parameter count and ImageNet-21k initialization.

18

vit-large-patch16-384Model42/100

via “imagenet-21k pre-trained image classification with vision transformer architecture”

image-classification model by undefined. 4,74,363 downloads.

Unique: Uses pure transformer architecture (no convolutional layers) with patch-based tokenization and ImageNet-21k pre-training (14M images, 14k classes) rather than ImageNet-1k only, enabling stronger transfer learning to downstream tasks. Implements efficient multi-head self-attention (16 heads) with linear complexity relative to sequence length through standard transformer design, avoiding the quadratic memory overhead of dense attention in large images.

vs others: Outperforms ResNet-152 and EfficientNet-B7 on ImageNet-1k accuracy (90.88% vs 82-84%) while maintaining comparable inference speed on modern GPUs; stronger transfer learning than CNN-based models due to global receptive field from first layer, but requires larger batch sizes and more training data for fine-tuning on small datasets

19

segformer-b2-finetuned-ade-512-512Fine-tune41/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 63,104 downloads.

Unique: Uses SegFormer's efficient hierarchical transformer encoder with linear projection decoder instead of dense convolutional decoders — reduces parameters by 90% vs DeepLabV3+ while maintaining competitive accuracy. Mix-transformer backbone progressively fuses multi-scale features without expensive upsampling operations, enabling faster inference on edge hardware.

vs others: Faster inference (2-3x speedup vs DeepLabV3+) with fewer parameters (27M vs 65M) while maintaining comparable mIoU on ADE20K, making it ideal for mobile/edge deployment where DeepLab variants are too heavy.

20

yolos-tinyModel40/100

via “vision transformer-based object detection with attention-weighted region proposals”

object-detection model by undefined. 83,525 downloads.

Unique: Applies pure transformer architecture (DETR-style with learnable object queries) to object detection instead of CNN backbones, enabling attention-based spatial reasoning without region proposal networks; tiny variant achieves 5.4M parameters through aggressive model compression while maintaining COCO detection capability

vs others: Simpler architecture than Faster R-CNN (no RPN) and more parameter-efficient than standard ViT detectors, but slower inference than optimized YOLO v5/v8 on edge devices due to transformer computational overhead

Top Matches

Also Known As

Company