Visual Tokenization With Variable Resolution Vae Supporting 2 16 To 2 64 Vocabulary Sizes

1

bert-base-multilingual-uncasedModel52/100

via “vocabulary-constrained token prediction with 30k wordpiece vocabulary”

fill-mask model by undefined. 39,74,711 downloads.

Unique: Uses a shared 30,522-token WordPiece vocabulary across 104 languages, enabling consistent subword tokenization and vocabulary-constrained predictions without language-specific token sets. The vocabulary includes multilingual character coverage and subword units learned from joint pretraining, providing deterministic and reproducible token predictions.

vs others: Shared vocabulary enables cross-lingual consistency and transfer learning; however, language-specific BERT models (e.g., RoBERTa for English) achieve higher vocabulary coverage and prediction accuracy for single-language tasks due to language-optimized tokenization.

2

stable-diffusion-v1-4Model51/100

via “variable output resolution via latent interpolation”

text-to-image model by undefined. 6,21,488 downloads.

Unique: Enables variable output resolutions via latent interpolation without retraining, supporting any multiple of 8 (e.g., 384, 512, 576, 640, 704, 768). Quality degrades gracefully for resolutions far from 512x512.

vs others: More flexible than fixed-resolution models; comparable to proprietary services' resolution support but with full control and transparency.

3

InfinityRepository45/100

via “visual tokenization with variable-resolution vae supporting 2^16 to 2^64 vocabulary sizes”

[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Unique: Supports variable vocabulary sizes (2^16 to 2^64) through configurable quantization, enabling dynamic quality-latency trade-offs. Unlike fixed-vocabulary tokenizers (e.g., VQ-VAE with 8192 tokens), Infinity's VAE can scale vocabulary exponentially without retraining, adapting to different deployment constraints.

vs others: Provides 4-8× more vocabulary flexibility than fixed-vocabulary tokenizers, enabling fine-grained control over reconstruction quality and sequence length without model retraining.

4

CogViewRepository44/100

via “image super-resolution via autoregressive token upsampling”

Text-to-Image generation. The repo for NeurIPS 2021 paper "CogView: Mastering Text-to-Image Generation via Transformers".

Unique: Performs super-resolution entirely in discrete token space using the same VQ-VAE tokenizer as the base model, enabling semantic-aware upsampling that preserves learned image structure. Reuses the cogview-sr checkpoint trained specifically for token-space upsampling, avoiding pixel-space artifacts.

vs others: Avoids pixel-space upsampling artifacts by operating in learned token manifold, but requires strict token distribution compatibility and is slower than single-pass CNN-based upsampling; stronger semantic preservation than GAN-based methods due to transformer attention.

5

Google: Gemma 3 12B (free)Model24/100

via “vision-language understanding with 128k token context”

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities,...

Unique: Combines vision encoding with a 128k token context window in a single unified model, allowing visual reasoning to leverage extended document history without separate retrieval or context management systems. Uses a patch-based vision encoder that integrates directly into the transformer token stream rather than as a separate modality branch.

vs others: Offers free access to multimodal reasoning with longer context than GPT-4V's 128k window (equivalent) but with lower latency than Claude 3.5 Vision for document-heavy workloads due to optimized vision encoder design.

6

Muse: Text-To-Image Generation via Masked Generative Transformers (Muse)Product23/100

via “vq-vae discrete tokenization for image compression and generation”

* ⭐ 02/2023: [Structure and Content-Guided Video Synthesis with Diffusion Models (Gen-1)](https://arxiv.org/abs/2302.03011)

Unique: Leverages learned discrete codebook from VQ-VAE rather than fixed quantization schemes, allowing the model to learn task-specific token representations that optimize for image generation quality rather than reconstruction fidelity

vs others: More efficient than pixel-space diffusion models because token sequences are 256x shorter than pixel sequences, reducing transformer computation from O(n²) to O(n²/256²) while maintaining competitive image quality

7

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (BEiT)Product23/100

via “discrete visual tokenization with learned codebook”

* ⭐ 09/2022: [PaLI: A Jointly-Scaled Multilingual Language-Image Model (PaLI)](https://arxiv.org/abs/2209.06794)

Unique: Uses learned discrete codebooks to tokenize images, creating a bridge between continuous vision features and discrete language tokens. This enables applying BERT-style masked language modeling directly to images without pixel-level reconstruction.

vs others: Provides better semantic alignment with language models than continuous feature representations because discrete tokens create a shared vocabulary between modalities, improving joint vision-language learning compared to approaches using separate continuous representations.

Top Matches

Also Known As

Company