RedPajama v2Dataset45/100 via “multi-language web-scale document collection with 40+ quality annotations”
30 trillion token web dataset with 40+ quality signals per document.
Unique: Processes 84 CommonCrawl dumps (claimed as most complete coverage vs. C4, Refinedweb, Dolma, SlimPajama) with 40+ pre-computed quality annotations per document, enabling fine-grained data curation research without requiring users to reprocess raw CommonCrawl. Open-source processing scripts allow reproducibility and custom filtering strategies on a standardized base dataset.
vs others: Larger scale (30 trillion tokens vs. C4's 156B tokens, RedPajama-1T's 1T tokens) with richer quality annotations (40+ signals vs. minimal metadata in competitors) and multilingual coverage, making it superior for comparative curation research and training diverse language models.