MINT-1T-PDF-CC-2023-14
DatasetFreeDataset by mlfoundations. 5,72,108 downloads.
Capabilities6 decomposed
large-scale multimodal document-image-text dataset loading
Medium confidenceProvides access to 1 trillion tokens of PDF-derived multimodal data (images + OCR text) from Common Crawl 2023-14, organized in WebDataset format for distributed streaming. Uses tar-based sharding architecture enabling efficient parallel loading across GPUs without requiring full dataset materialization on disk. Integrates with HuggingFace datasets library and MLCroissant metadata standard for reproducible, versioned access to 5.7M+ document samples.
Combines 1T tokens of PDF-derived content from Common Crawl with WebDataset sharding for distributed streaming, enabling sub-second per-sample access without full materialization — unlike static image-text datasets (LAION, CC3M) that require download or local indexing
Offers 10x larger scale than LAION-5B for document-specific content with native OCR alignment, while maintaining streaming efficiency that COCO and Flickr30K lack due to their centralized file structures
ocr-aligned image-text pair extraction from pdfs
Medium confidenceAutomatically extracts and aligns image renderings of PDF pages with their corresponding OCR text output, preserving spatial relationships and document structure. Uses PDF parsing to generate page images at consistent DPI (72-300) and applies OCR engines (likely Tesseract or similar) to produce character-level text with bounding box metadata. Deduplication via content hashing removes near-duplicate pages across Common Crawl crawls.
Provides 1T-token scale OCR-image pairs with automatic deduplication across Common Crawl snapshots, using content hashing to eliminate redundant pages — most document datasets (DocVQA, RVL-CDIP) manually curate smaller, domain-specific collections without cross-crawl deduplication
Scales to 5.7M documents with automated deduplication, whereas DocVQA (12K docs) and IIT-CDIP (6M pages) require manual curation or are domain-specific; offers broader diversity than academic paper datasets (arXiv, S2-ORC)
streaming-based distributed dataset loading for multi-gpu training
Medium confidenceImplements WebDataset-compatible tar-based sharding that enables efficient parallel loading across distributed training clusters without materializing the full dataset on local storage. Each shard contains ~1000 samples; workers fetch shards on-demand and decompress in-memory, with built-in support for HuggingFace Datasets streaming mode and PyTorch DataLoader integration. Supports deterministic shuffling via seed-based shard ordering for reproducible training runs.
Uses tar-based WebDataset sharding with on-demand decompression and deterministic seed-based shuffling, enabling distributed training without centralized storage — most large datasets (ImageNet, COCO) require pre-download or NAS mounting, adding deployment complexity
Eliminates storage bottleneck compared to LAION-5B (requires 330GB download) and provides native streaming support that static dataset formats (COCO, Flickr30K) lack; comparable to LAION's WebDataset approach but with larger scale and PDF-specific preprocessing
mlcroissant metadata standard compliance and reproducibility
Medium confidencePublishes dataset metadata in MLCroissant format (W3C standard for machine learning datasets), enabling automated discovery, versioning, and reproducible access through standardized schema. Includes structured descriptions of splits, features, licenses, and data provenance (Common Crawl 2023-14 snapshot). Enables tools like HuggingFace Hub and Croissant parsers to automatically validate dataset integrity and generate data cards.
Implements W3C MLCroissant standard for dataset metadata, enabling automated discovery and validation through standardized schema — most large datasets (LAION, COCO) publish metadata in ad-hoc formats (JSON, YAML) without formal schema compliance
Provides machine-readable, standardized metadata that enables automated tooling and discovery, whereas LAION and other large datasets rely on unstructured documentation; comparable to Hugging Face's dataset cards but with formal W3C compliance
common crawl 2023-14 snapshot filtering and deduplication
Medium confidenceCurates and deduplicates content from Common Crawl's 2023-14 snapshot using content hashing (likely SHA-256 or similar) to remove near-duplicate PDF pages across multiple crawl cycles. Applies language detection to filter predominantly English documents and removes known low-quality sources. Preserves document source URLs and metadata for traceability.
Applies cross-crawl deduplication using content hashing to Common Crawl 2023-14 snapshot, eliminating redundant PDFs that appear in multiple crawl cycles — most web-scale datasets (LAION, C4) deduplicate within a single crawl but not across temporal snapshots
Provides cleaner, deduplicated content than raw Common Crawl while maintaining web-scale diversity; more authentic than manually curated datasets (DocVQA, RVL-CDIP) but less curated than academic paper collections (arXiv, S2-ORC)
variable-resolution image rendering with dpi consistency
Medium confidenceRenders PDF pages to images at configurable DPI (72-300 range) to balance visual fidelity with storage efficiency. Uses PDF rendering engines (likely poppler or similar) to convert vector-based PDF content to raster images while preserving text and layout information. Applies consistent DPI across dataset to enable batch processing without resolution normalization.
Applies consistent DPI rendering across 5.7M documents from diverse PDF sources, enabling batch processing without per-sample resolution normalization — most document datasets (DocVQA, RVL-CDIP) use variable resolutions or require downstream normalization
Provides consistent rendering quality that enables efficient batching, whereas raw PDF rendering varies by engine; more scalable than manual curation but less controlled than synthetic document generation
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with MINT-1T-PDF-CC-2023-14, ranked by overlap. Discovered automatically through the match graph.
MINT-1T-PDF-CC-2023-23
Dataset by mlfoundations. 6,33,111 downloads.
MINT-1T-PDF-CC-2024-18
Dataset by mlfoundations. 10,34,415 downloads.
MINT-1T-PDF-CC-2023-06
Dataset by mlfoundations. 5,39,406 downloads.
nbchr_pdfs
Dataset by daniilakk. 3,12,297 downloads.
MINT-1T-PDF-CC-2023-40
Dataset by mlfoundations. 8,57,357 downloads.
PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
Best For
- ✓ML researchers training large vision-language models (CLIP, LLaVA scale)
- ✓teams building document AI systems requiring diverse real-world PDF samples
- ✓organizations needing pre-processed, deduplicated multimodal training corpora
- ✓document layout analysis researchers
- ✓teams building document understanding pipelines (form extraction, table recognition)
- ✓OCR model developers needing diverse, real-world training examples
- ✓ML teams with distributed training infrastructure (Ray, PyTorch DDP, DeepSpeed)
- ✓organizations training models on cloud infrastructure with limited persistent storage
Known Limitations
- ⚠5.7M samples may be insufficient for training models >10B parameters without augmentation
- ⚠OCR quality varies by source document; no per-sample quality scores provided
- ⚠WebDataset format requires sequential access patterns; random sampling requires full enumeration
- ⚠CC-BY-4.0 license requires attribution in derivative works; commercial use requires compliance verification
- ⚠No built-in filtering for sensitive document types (medical, financial, PII); requires downstream curation
- ⚠OCR accuracy varies significantly by document quality, font, and language; no per-sample confidence scores
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
MINT-1T-PDF-CC-2023-14 — a dataset on HuggingFace with 5,72,108 downloads
Categories
Alternatives to MINT-1T-PDF-CC-2023-14
Are you the builder of MINT-1T-PDF-CC-2023-14?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →