MINT-1T-PDF-CC-2023-50

multimodal image-text pair extraction from pdf documents at scalereproducible dataset versioning and metadata discovery via mlcroissant standard

MINT-1T-PDF-CC-2023-23

Dataset by mlfoundations. 6,33,111 downloads.

large-scale multimodal document-image-text dataset loadingmlcroissant metadata standard compliance and reproducibility

MINT-1T-PDF-CC-2023-14

Dataset by mlfoundations. 5,72,108 downloads.

multimodal document-to-text extraction at scaledocument-domain dataset sampling and filtering

MINT-1T-PDF-CC-2023-40

Dataset by mlfoundations. 8,57,357 downloads.

large-scale multimodal document-image-text dataset curation and indexingstreaming dataset access with lazy loading and batching

MINT-1T-PDF-CC-2023-06

Dataset by mlfoundations. 5,39,406 downloads.

dataset schema introspection and metadata extractionstructured text dataset loading with multi-format support

debug

Dataset by rtrm. 4,15,242 downloads.

Visit MINT-1T-PDF-CC-2023-50→

Best For

✓ML researchers training vision-language models (CLIP, LLaVA, etc.)
✓Teams building document intelligence systems for enterprise use
✓Researchers studying multimodal learning on real-world data distributions
✓Teams with distributed training infrastructure (multi-GPU, multi-node setups)
✓Researchers with limited local storage but good network bandwidth
✓Production ML pipelines requiring fault-tolerant data loading
✓ML teams with governance requirements (compliance, licensing tracking)
✓Researchers publishing models and needing transparent data attribution

Known Limitations

⚠English-only content — no multilingual document support
⚠Fixed to 2023 Common Crawl snapshot — no real-time updates or historical versions
⚠WebDataset format requires compatible loaders; not directly compatible with standard PyTorch DataLoader without adapter code
⚠Image quality varies by source PDF; no quality filtering or deduplication applied
⚠No built-in train/val/test splits — requires manual partitioning for reproducible experiments
⚠Sequential access pattern within tar archives — random access requires full archive decompression

Requirements

Python 3.8+HuggingFace datasets library (>=2.0)webdataset library for tar-based streaming~500GB+ disk space for full dataset or streaming access via HuggingFace Hubmlcroissant library for metadata inspection (optional)webdataset Python library (>=0.2.0)Network bandwidth >=10 Mbps for practical training throughputPyTorch or TensorFlow with distributed training support

Input / Output

Accepts: PDF documents (raw binary from Common Crawl), Document URLs and metadata, Tar-archived sample collections, Dataset configuration (shard indices, batch size, worker count), MLCroissant JSON metadata file, Common Crawl 2023 WARC index, PDF URLs and metadata, PDF documents with embedded images and text, CC-BY-4.0 license declaration, Source URL metadata for attribution

Produces: Extracted text (UTF-8 strings with layout preservation), Image tensors (PIL Image or numpy arrays), Document metadata (source URL, extraction timestamp, page count), Batched tensors (images, text), Sample metadata dictionaries, Streaming iterators compatible with training loops, Parsed schema (field names, types, cardinality), Provenance metadata (source, timestamp, version), Licensing and attribution information, Deduplicated PDF document collection, Extracted text and images per document, Source URL and crawl metadata, Images with bounding box coordinates, Text with page coordinates and positioning, Document structure metadata (page layout, text flow order), Licensed dataset with clear attribution requirements, Metadata enabling downstream attribution

UnfragileRank

Adoption15%(35% weight)

Quality14%(25% weight)

Ecosystem60%(20% weight)

Match Graph10%(15% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

6 capabilities

About

MINT-1T-PDF-CC-2023-50 — a dataset on HuggingFace with 7,96,577 downloads

Alternatives to MINT-1T-PDF-CC-2023-50

wink-embeddings-sg-100d24Repository

100-dimensional English word embeddings for wink-nlp

voyage-ai-provider30API

Voyage AI Provider for running Voyage AI models with Vercel AI SDK

@vibe-agent-toolkit/rag-lancedb27Agent

LanceDB implementation of RAG interfaces for vibe-agent-toolkit

vectra41Repository

A lightweight, file-backed vector database for Node.js and browsers with Pinecone-compatible filtering and hybrid BM25 search.

Are you the builder of MINT-1T-PDF-CC-2023-50?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

huggingface

Looking for something else?

Search →

Capabilities6 decomposed

multimodal pdf-to-text extraction at scale

Medium confidence

Solves for

Best for

ML researchers training vision-language models (CLIP, LLaVA, etc.)

Teams building document intelligence systems for enterprise use

Researchers studying multimodal learning on real-world data distributions

Requires

Python 3.8+

HuggingFace datasets library (>=2.0)

webdataset library for tar-based streaming

Limitations

English-only content — no multilingual document support

Fixed to 2023 Common Crawl snapshot — no real-time updates or historical versions

WebDataset format requires compatible loaders; not directly compatible with standard PyTorch DataLoader without adapter code

What makes it unique

vs alternatives

streaming dataset access via webdataset protocol

Medium confidence

Solves for

Best for

Teams with distributed training infrastructure (multi-GPU, multi-node setups)

Researchers with limited local storage but good network bandwidth

Production ML pipelines requiring fault-tolerant data loading

Requires

webdataset Python library (>=0.2.0)

HuggingFace datasets library (>=2.0)

Network bandwidth >=10 Mbps for practical training throughput

Limitations

Sequential access pattern within tar archives — random access requires full archive decompression

Network latency adds ~50-200ms per tar shard fetch depending on cloud region

Requires compatible training framework integration (PyTorch Lightning, Hugging Face Transformers); raw PyTorch DataLoader needs adapter code

What makes it unique

vs alternatives

mlcroissant metadata schema exposure

Medium confidence

Solves for

Best for

ML teams with governance requirements (compliance, licensing tracking)

Researchers publishing models and needing transparent data attribution

Automated ML platforms building dataset discovery and validation layers

Requires

mlcroissant library (>=0.3.0) for metadata parsing

JSON schema validation tools (optional, for custom compliance checks)

Limitations

MLCroissant standard is still evolving — schema may change in future versions

Metadata does not include per-sample quality scores or filtering recommendations

No built-in validation of actual extracted content against declared schema

What makes it unique

Implements MLCroissant standard for machine-readable dataset metadata, enabling automated schema validation and licensing compliance checks rather than relying on human-readable documentation alone

vs alternatives

More structured and machine-actionable than HuggingFace dataset cards (which are markdown-based); enables programmatic validation and governance that generic dataset documentation cannot provide

common crawl pdf document sourcing and deduplication

Medium confidence

Solves for

Best for

Researchers building large-scale document understanding models

Teams needing representative samples of real-world PDF distributions

Organizations requiring transparent, reproducible data sourcing

Requires

Access to Common Crawl WARC archives (publicly available via AWS S3)

PDF parsing library (PyPDF2, pdfplumber, or similar)

URL canonicalization and deduplication logic

Limitations

Fixed to 2023 Common Crawl snapshot — no real-time updates or ability to add newer documents

URL-based deduplication may miss semantically duplicate content with different URLs

No filtering for document quality, readability, or relevance — includes low-quality scans and corrupted PDFs

What makes it unique

vs alternatives

More cost-effective and reproducible than independent web crawling; larger and more diverse than manually curated document datasets, though with lower average quality due to lack of human filtering

image-text spatial relationship preservation in document extraction

Medium confidence

Solves for

Best for

Researchers building document intelligence systems (table extraction, form parsing)

Teams training layout-aware vision-language models

Organizations building document classification systems

Requires

PDF parsing library with layout analysis (pdfplumber, PyPDF2 with layout plugins)

Custom data loaders to handle spatial metadata alongside images and text

Limitations

Spatial metadata increases dataset size by ~20-30% compared to flattened image-text pairs

Coordinate systems vary by PDF library — no standardization across different extraction tools

Complex layouts (multi-column, rotated text) may not preserve correctly depending on PDF structure

What makes it unique

vs alternatives

cc-by-4.0 licensed dataset with transparent attribution

Medium confidence

Solves for

Best for

Researchers and companies requiring open-source training data with commercial use rights

Organizations with strict licensing compliance requirements

Open-source AI projects needing legally clear training data

Requires

Understanding of CC-BY-4.0 license terms and attribution requirements

Mechanism to track and provide attribution to source URLs in downstream models

Limitations

CC-BY-4.0 requires attribution in derived works — no enforcement mechanism in dataset itself

Original PDF creators may not have intended their documents for ML training — ethical concerns despite legal compliance

Some PDFs may contain third-party copyrighted content (images, text) not covered by CC-BY-4.0

What makes it unique

Provides transparent CC-BY-4.0 licensing with source URL metadata enabling proper attribution, rather than generic 'open source' claims without clear provenance tracking

vs alternatives

More legally transparent than proprietary datasets; clearer licensing than some academic datasets that lack explicit license declarations, enabling confident commercial use

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to MINT-1T-PDF-CC-2023-50

wink-embeddings-sg-100d24Repository

100-dimensional English word embeddings for wink-nlp

voyage-ai-provider30API

Voyage AI Provider for running Voyage AI models with Vercel AI SDK

@vibe-agent-toolkit/rag-lancedb27Agent

LanceDB implementation of RAG interfaces for vibe-agent-toolkit

vectra41Repository

A lightweight, file-backed vector database for Node.js and browsers with Pinecone-compatible filtering and hybrid BM25 search.