Capability
Multilingual Web Corpus With Consistent Annotation Across 5 Languages
9 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “language-aware dataset organization and filtering across 100+ languages”
5.85 billion image-text pairs foundational for image generation.
Unique: Pre-organized into language clusters (2.3B English, 2.2B multilingual across 100+ languages) enabling direct access to language-specific subsets without re-processing; supports non-English vision-language model training at scale
vs others: Larger multilingual coverage than most open datasets; however, language assignment reliability is lower than human-curated datasets, and language distribution is skewed toward English and high-resource languages