Version Control For Datasets With Branching And Tagging

1

Weights & Biases APIAPI58/100

via “dataset-versioning-and-lineage-tracking”

MLOps API for experiment tracking and model management.

Unique: Datasets are versioned as immutable artifacts (content-addressed) and automatically linked to experiments that use them, creating an auditable lineage chain from raw data → preprocessing → training → model. Aliases enable semantic versioning (e.g., 'production-data' always points to the latest approved dataset) without duplication. Integration with W&B Reports enables visual lineage dashboards.

vs others: Tighter integration with experiment tracking than DVC (no separate setup) and automatic lineage without manual metadata entry; supports self-hosted deployment unlike cloud-only data registries like Hugging Face Datasets.

2

The Stack v2Dataset58/100

via “dataset versioning and reproducibility tracking”

67 TB permissively licensed code dataset across 600+ languages.

Unique: Maintains semantic versioning and detailed changelogs for dataset releases, enabling researchers to cite specific versions and understand dataset evolution — more rigorous than one-off dataset releases without versioning

vs others: More reproducible than academic datasets that are released once without versioning, and more transparent than commercial datasets (Codex) that don't disclose version history or changes

3

EncordDataset57/100

via “dataset-versioning-and-lineage-tracking”

AI annotation platform with medical imaging support.

Unique: Encord's integrated dataset versioning with full lineage tracking enables reproducible model training and compliance documentation by maintaining complete audit trails from raw data through annotation to model deployment

vs others: Encord's unified versioning and lineage tracking is more efficient than competitors requiring separate version control systems (Git) and manual lineage documentation, enabling reproducible ML pipelines with built-in compliance support

4

Weights & BiasesPlatform56/100

via “dataset-versioning-with-artifact-lineage”

ML experiment tracking — logging, sweeps, model registry, dataset versioning, LLM tracing.

Unique: Integrates dataset versioning directly into the experiment tracking workflow — datasets are logged as artifacts within runs, creating automatic lineage between data versions and model versions without separate metadata management.

vs others: Simpler than DVC for teams already using W&B for experiment tracking because datasets are versioned in the same system as models and metrics, avoiding multi-tool coordination and metadata synchronization.

5

StarCoder DataDataset56/100

via “dataset versioning and reproducibility tracking”

783 GB curated code dataset from 86 languages with PII redaction.

Unique: Maintains versioned snapshots with full provenance tracking (processing parameters, deduplication thresholds, opt-outs) enabling reproducible model training and dataset auditing. Treats dataset composition as a first-class artifact requiring version control and documentation.

vs others: More reproducible than static dataset releases because it documents exact processing parameters and enables version-specific citations, allowing researchers to understand how dataset changes affect model behavior and supporting scientific reproducibility.

6

ClearMLRepository55/100

via “dataset versioning and artifact management with content-addressable storage”

Open-source MLOps — experiment tracking, pipelines, data management, auto-logging, self-hosted.

Unique: Implements content-addressable storage with SHA256-based deduplication across datasets, automatically tracking dataset lineage and associating versions with experiments via the Task context, supporting multi-cloud backends (S3, GCS, Azure) with unified API

vs others: Provides tighter integration with experiment tracking than DVC (which is primarily a Git-based versioning tool) and lower operational overhead than Pachyderm (which requires Kubernetes), though lacks DVC's Git-native workflow

7

ArgillaRepository55/100

via “dataset versioning and snapshot management”

Open-source data curation for LLM fine-tuning and RLHF.

Unique: Implements immutable snapshots with delta encoding and version metadata tracking, enabling efficient storage of dataset history while maintaining full audit trails with author attribution and change summaries

vs others: Provides built-in versioning unlike Label Studio (requires external version control), and simpler than DVC-based approaches by storing versions within the platform rather than requiring separate infrastructure

8

DVCRepository55/100

via “content-addressable data versioning with git-tracked metadata”

Git for data and ML — version large files, experiment tracking, pipeline DAGs, remote storage.

Unique: Uses Git as the single source of truth for metadata (.dvc files) while separating data storage, enabling version control without Git's file size limitations. The Output class implements content-addressable storage with automatic deduplication, unlike traditional Git LFS which stores full copies per version.

vs others: Lighter than Git LFS (no full-file copies per version) and more flexible than DVC-less approaches because metadata lives in Git history, enabling reproducible data retrieval across branches and commits.

9

deeplakeMCP Server51/100

Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.

Unique: Applies Git-like version control semantics to datasets rather than code, with commits, branches, and tags stored as delta snapshots rather than full copies. Enables collaborative dataset curation workflows where teams branch independently and merge changes, with conflict detection on overlapping tensor modifications.

vs others: More sophisticated than simple dataset snapshots (like DVC) because it supports branching and merging; more efficient than full-copy versioning because it stores only deltas between versions, reducing storage by 70-90% for typical workflows.

10

DVC (deprecated)Extension42/100

via “data-versioning-with-remote-storage-sync”

Machine learning experiment management with tracking, plots, and data versioning.

Unique: Uses content-addressable storage (SHA256 hashing) to deduplicate data across versions and experiments, reducing storage costs and enabling efficient branching of datasets. Unlike Git LFS (which stores pointers), DVC stores actual file hashes in dvc.lock, enabling deterministic reproduction of data pipelines.

vs others: More flexible than Git LFS for multi-version data management and supports more storage backends, but requires explicit pull/push operations unlike Git's automatic tracking, and lacks the simplicity of Git LFS for small binary files.

11

DVC by lakeFSExtension36/100

via “data versioning and remote storage synchronization”

Machine learning experiment management with tracking, plots, and data versioning.

Unique: Separates data versioning from code versioning by storing only content hashes in Git while maintaining actual data on remote backends, enabling teams to version large datasets without Git repository bloat. Uses content-addressable storage (hash-based deduplication) to avoid storing duplicate data across versions, reducing storage costs and network bandwidth.

vs others: More lightweight than DVC standalone CLI by integrating directly into VS Code UI, and avoids proprietary data platforms (Pachyderm, Delta Lake) by using standard cloud storage backends (S3, Azure, GCS) that teams already operate, reducing vendor lock-in.

12

dvcCLI Tool29/100

via “git-integrated data versioning with content-addressed storage”

Git for data scientists - manage your code and data together

Unique: Implements a two-layer storage model (Git metadata + content-addressed cache) with automatic deduplication via SHA256, allowing teams to version datasets without Git bloat while maintaining full reproducibility through immutable hashes. The Repo class acts as a central coordinator between Git's SCM layer and DVC's FileSystem abstraction, enabling transparent data management.

vs others: More lightweight than DVC alternatives like Pachyderm (no Kubernetes required) and more Git-native than cloud-only solutions like Weights & Biases, but requires explicit remote storage setup unlike some commercial competitors

13

Hugging face datasetsDataset27/100

via “dataset versioning and reproducibility with commit-based tracking”

[Slack](https://camel-kwr1314.slack.com/join/shared_invite/zt-1vy8u9lbo-ZQmhIAyWSEfSwLCl2r2eKA#/shared-invite/email)

Unique: Uses content-addressed storage with commit hashes derived from dataset contents and transformation DAGs, enabling automatic deduplication of identical datasets across versions. Integrates with Hugging Face Hub's Git-based infrastructure for seamless version management without separate tooling.

vs others: More integrated with ML workflows than DVC (Data Version Control) because it's built into the Hugging Face ecosystem and doesn't require separate Git LFS setup, while providing stronger reproducibility guarantees than manual versioning.

14

datasetsDataset26/100

via “dataset versioning and hub repository management with git-based tracking”

HuggingFace community-driven open-source library of datasets

Unique: Integrates Git-based version control with Hugging Face Hub for dataset versioning, using Git LFS for efficient large file storage. The system automatically manages dataset cards and metadata, providing a unified interface for dataset publication and collaboration.

vs others: More integrated than manual Git workflows; provides automatic dataset card generation unlike raw Git repositories; Hub integration enables discoverability unlike private Git repos.

15

documentation-imagesDataset24/100

via “version-control-and-reproducibility”

Dataset by huggingface. 25,31,937 downloads.

Unique: Leverages HuggingFace's git-based versioning infrastructure to provide dataset version control as a first-class feature, eliminating the need for manual snapshot management or external version control systems

vs others: More integrated than external version control (DVC, Pachyderm) because versioning is built into the dataset platform itself, and more transparent than snapshot-based systems because full git history is queryable

16

mdm_depthDataset24/100

via “depth dataset versioning and reproducibility tracking”

Dataset by robbyant. 3,88,267 downloads.

Unique: Integrates with HuggingFace Hub's native Git versioning, allowing researchers to specify exact dataset versions in code (e.g., `revision='v2.1'`) without manual archive management; automatically tracks dataset lineage and preprocessing changes

vs others: More transparent and auditable than proprietary dataset platforms (AWS Open Data, Google Dataset Search) that don't expose version history; simpler than maintaining separate dataset registries or data catalogs

17

hellaswagDataset24/100

via “dataset-versioning-and-reproducible-snapshot-management”

Dataset by Rowan. 3,02,991 downloads.

Unique: Leverages HuggingFace Hub's Git-based versioning to provide immutable dataset snapshots with automatic caching and rollback support, without requiring separate version control infrastructure

vs others: More convenient than manual dataset versioning (Git, DVC) and simpler than data warehouse versioning, with tight integration to HuggingFace's ecosystem and automatic caching

18

medical-qa-shared-task-v1-toyDataset24/100

via “dataset versioning and reproducible snapshot loading”

Dataset by lavita. 5,55,826 downloads.

Unique: Leverages HuggingFace Hub's Git-based versioning infrastructure to provide immutable dataset snapshots with full history tracking. Enables citation-grade reproducibility through semantic versioning and automatic version pinning in code.

vs others: More reproducible than ad-hoc dataset downloads because versions are immutable and citable; better than manual versioning because Git history is automatically maintained and queryable

19

vlm_test_imagesDataset24/100

via “dataset versioning and reproducibility tracking”

Dataset by merve. 2,77,478 downloads.

Unique: Leverages HuggingFace Hub's native versioning with commit-level pinning and MLCroissant metadata integration, enabling reproducible dataset references without external version control

vs others: More reproducible than manual dataset snapshots, with built-in citation generation vs custom versioning scripts

20

img_uploadDataset23/100

via “dataset versioning and reproducibility tracking via huggingface hub”

Dataset by Maynor996. 6,17,655 downloads.

Unique: Uses HuggingFace Hub's Git-based versioning with LFS support for large files, enabling immutable dataset snapshots with commit-level granularity — differentiates from snapshot-based versioning (e.g., S3 versioning) by providing semantic version control with commit messages and author tracking

vs others: More reproducible than datasets without versioning because specific revisions are resolvable and immutable; simpler than maintaining local dataset copies because versioning is managed centrally on Hub with automatic deduplication

Top Matches

Also Known As

Company