Dataset Filtering And Sampling For Model Training And Evaluation

1

Phi-3.5 MiniModel59/100

via “synthetic and filtered training data quality optimization”

Microsoft's 3.8B model with 128K context for edge deployment.

Unique: Achieves 69% MMLU and competitive reasoning performance in 3.8B parameters through explicit focus on training data quality (synthetic + filtered) rather than scale, demonstrating that data curation can partially offset parameter count disadvantages

vs others: Prioritizes data quality over dataset size (vs. Llama 3.2 trained on broader web data), reducing bias and toxicity at the cost of potentially narrower knowledge coverage; enables stronger performance on benchmark tasks despite smaller size

2

Hugging face datasetsDataset27/100

via “dataset filtering and sampling with complex query expressions”

[Slack](https://camel-kwr1314.slack.com/join/shared_invite/zt-1vy8u9lbo-ZQmhIAyWSEfSwLCl2r2eKA#/shared-invite/email)

Unique: Uses Arrow's compute kernels for filter expression evaluation, enabling efficient column-based filtering without materializing data. Implements deterministic sampling using seeded hashing to ensure reproducibility across runs.

vs others: More efficient than pandas filtering for large datasets because it uses Arrow's columnar format and lazy evaluation, and more flexible than SQL WHERE clauses because it supports custom Python functions.

3

mdm_depthDataset25/100

via “depth dataset filtering and subset selection by scene attributes”

Dataset by robbyant. 3,88,267 downloads.

Unique: Leverages HuggingFace datasets' lazy filtering to avoid full dataset materialization; enables efficient subset creation without downloading unused samples, critical for large-scale datasets

vs others: More efficient than downloading full dataset and filtering locally; more flexible than pre-split dataset versions that lock users into fixed train/val/test divisions

4

droid_1.0.1Dataset25/100

via “task-agnostic trajectory filtering and sampling”

Dataset by cadene. 3,11,762 downloads.

Unique: Leverages Parquet metadata indexing to filter trajectories without loading full episodes, combined with stratified sampling to balance long-tail task distributions — avoiding the memory overhead and sampling bias of post-load filtering

vs others: Enables efficient task-specific data selection at the dataset level, whereas most robotics datasets require loading full data into memory and filtering in application code, incurring significant memory and I/O overhead

5

hellaswagDataset25/100

via “dataset-filtering-and-subset-selection-by-metadata”

Dataset by Rowan. 3,02,991 downloads.

Unique: Implements filtering via HuggingFace's columnar operations (Arrow) for efficient predicate pushdown, avoiding full dataset materialization while maintaining lazy evaluation semantics

vs others: More efficient than pandas filtering (columnar operations vs row-wise) and simpler than SQL queries, with native integration to HuggingFace's caching and streaming infrastructure

6

debugDataset24/100

via “dataset filtering and sampling for model evaluation”

Dataset by rtrm. 3,31,078 downloads.

Unique: Implements lazy evaluation for filter/map operations, deferring computation until data is accessed, enabling efficient filtering of large datasets without materializing intermediate results in memory

vs others: More memory-efficient than pandas filtering because operations are lazy; more reproducible than manual random sampling because random seeds are built-in and deterministic

7

FineFineWebDataset24/100

via “text classification dataset sampling and filtering”

Dataset by m-a-p. 4,59,057 downloads.

Unique: Leverages HuggingFace's native filtering and sampling APIs (via .filter() and .select()) to enable in-memory or streaming-based subset extraction without full corpus download; supports seed-based reproducibility for deterministic splits across experiments

vs others: More flexible than static benchmark datasets (ImageNet, MNIST) because filtering is dynamic and user-defined; faster iteration than manual annotation while maintaining reproducibility through versioned dataset snapshots

8

MINT-1T-PDF-CC-2023-40Dataset24/100

via “document-domain dataset sampling and filtering”

Dataset by mlfoundations. 8,57,357 downloads.

Unique: Provides streaming access with metadata-based filtering on trillion-token dataset without requiring full download, using Hugging Face Datasets infrastructure for efficient subset construction. Enables on-demand domain-specific corpus creation from larger collection.

vs others: More flexible than fixed-size domain datasets (e.g., ArXiv papers, legal documents) by allowing dynamic filtering from larger corpus; more efficient than downloading full dataset for subset access.

9

upload2Dataset24/100

via “dataset filtering and sampling with predicate-based selection”

Dataset by Maynor996. 6,62,770 downloads.

Unique: Implements predicate pushdown to Arrow layer, allowing filters to be evaluated on disk before data is loaded into Python memory; supports lazy evaluation so filtered datasets are not materialized until iteration

vs others: More memory-efficient than pandas-based filtering because predicates operate on Arrow columnar format; faster than loading full dataset and filtering in Python because filtering happens at storage layer

10

fineweb-edu-translatedDataset24/100

via “language-specific document filtering and sampling”

Dataset by Helsinki-NLP. 3,48,667 downloads.

Unique: Leverages HuggingFace's columnar parquet storage and streaming API to enable language-level filtering without full dataset materialization — most competing datasets require downloading entire corpus or provide only coarse-grained splits (e.g., by language family rather than individual language codes)

vs others: Faster iteration than downloading full 384K-document corpus; more granular language selection than datasets offering only pre-split language-family buckets

11

MINT-1T-PDF-CC-2024-18Dataset24/100

via “multimodal dataset sampling and stratification for balanced model training”

Dataset by mlfoundations. 10,34,415 downloads.

Unique: Enables stratified sampling across document types and content properties at scale, allowing researchers to control training data distribution — most large datasets provide raw access without built-in stratification mechanisms

vs others: More flexible than fixed dataset splits; enables targeted evaluation on specific document categories; supports research on dataset bias and distribution effects

12

hd_tmpDataset22/100

Dataset by ayuo. 14,99,354 downloads.

Unique: Implements lazy filter evaluation using Apache Arrow's predicate pushdown, avoiding full dataset materialization; combines with stratified sampling for balanced subset creation without requiring pre-computed group labels

vs others: More memory-efficient than pandas-style filtering for large datasets, but less expressive than SQL queries for complex multi-condition filtering

13

nbchr_pdfsDataset22/100

via “document corpus search and sampling for research”

Dataset by daniilakk. 3,16,648 downloads.

Unique: Leverages HuggingFace's native dataset streaming and sampling APIs, enabling efficient subset creation without full corpus download, with reproducible random seeding for research rigor

vs others: More accessible than building custom search infrastructure over static PDF archives, though lacks domain-specific search capabilities (e.g., document type, layout features) compared to specialized document retrieval systems

14

V7Product

via “dataset-filtering-and-sampling”

15

ActiveLoop.aiProduct

via “efficient data sampling and subset creation”

16

EncordProduct

via “data-curation-and-filtering”

17

ChatHubProduct

via “model selection and filtering”

18

DatasaurProduct

via “data-sampling-for-annotation”

Top Matches

Also Known As

Company