Capability
Research Data Aggregation
20 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “commoncrawl-scale data aggregation from 84 dumps”
30 trillion token web dataset with 40+ quality signals per document.
Unique: Processes 84 CommonCrawl dumps (claimed as most complete coverage vs. C4, Refinedweb, Dolma, SlimPajama) into a single, consistently-annotated dataset. Eliminates user burden of managing multiple dumps and implementing aggregation logic.
vs others: Larger scale (30 trillion tokens, 84 dumps) than competitors (C4: 156B tokens, Refinedweb: limited dumps, Dolma: limited dumps); unified dataset eliminates user aggregation burden but inherits web biases from CommonCrawl.