Capability

Multi Domain Pretraining Corpus Assembly

1 artifact provides this capability.

Want a personalized recommendation?

Top Matches

via “multi-domain pretraining corpus assembly”

EleutherAI's 825 GiB diverse training dataset from 22 sources.

Unique: Pioneered the multi-domain curation approach by intentionally combining 22 diverse, high-quality subsets (academic papers, books, code, web, specialized sources) rather than scraping a single massive web corpus. This architectural choice prioritizes knowledge breadth and domain coverage over raw scale, influencing the design of subsequent open datasets like LAION, RedPajama, and Falcon-Refinedweb.

vs others: Broader domain coverage than Common Crawl-only datasets (e.g., C4) and higher quality than raw web scrapes due to curation of academic, code, and book sources; smaller than Falcon-Refinedweb (1.5T tokens) but more carefully curated and widely adopted as a benchmark for model evaluation

Multi Domain Pretraining Corpus Assembly

Top Matches

Also Known As

Company