img2dataset
DatasetFreeEasily turn a set of image urls to an image dataset
Capabilities10 decomposed
multi-format url list parsing and metadata extraction
Medium confidenceThe Reader component parses input URL lists from multiple formats (CSV, JSON, JSONL, Parquet) and extracts associated metadata like captions, alt text, and image attributes. It uses temporary feather files for memory-efficient handling of large datasets, sharding the input into work units that can be distributed across workers. This design allows processing of datasets ranging from thousands to billions of images without loading entire datasets into memory.
Uses feather file intermediate format for memory-efficient sharding of billion-scale datasets, avoiding full in-memory loading while maintaining fast random access for distributed workers
More memory-efficient than tools that load entire URL lists into RAM (e.g., basic wget scripts or simple Python loops), enabling processing of datasets larger than available system memory
concurrent http image downloading with thread pooling
Medium confidenceThe Downloader component creates a thread pool to fetch multiple images concurrently from URLs, integrating HTTP request handling, optional hash verification, robots.txt directive checking, image decoding, and error handling throughout the pipeline. Each worker maintains its own thread pool, allowing fine-grained control over concurrency levels and connection pooling. The architecture supports custom HTTP headers, timeout configuration, and graceful handling of network failures with retry logic.
Integrates robots.txt compliance checking and hash verification directly into the download pipeline, with per-worker thread pools enabling fine-grained concurrency control across distributed workers
More robust than simple wget/curl loops because it handles robots.txt directives, verifies image integrity, and provides granular error reporting; faster than sequential downloads by using thread pools per worker
multi-mode image resizing and normalization
Medium confidenceThe Resizer component applies configurable image transformations including multiple resize modes (e.g., center crop, pad, stretch), format conversion, and quality normalization. It supports various resize strategies to handle aspect ratio preservation, enabling datasets with consistent dimensions for model training. The component integrates with the download pipeline to process images immediately after decoding, reducing memory footprint by avoiding storage of full-resolution intermediates.
Integrates resizing directly into the download pipeline as an in-memory transformation, avoiding intermediate storage of full-resolution images and reducing disk I/O overhead
More efficient than post-processing resizing because it reduces memory footprint and disk writes; supports multiple resize modes natively without external image processing tools
distributed dataset writing with multiple output formats
Medium confidenceThe SampleWriter component outputs processed images and metadata in multiple formats optimized for different ML frameworks (WebDataset, Parquet, LMDB, TFRecord). It handles sharded output to avoid bottlenecks, writing data in parallel across workers. The component manages file organization, metadata serialization, and format-specific optimizations (e.g., tar-based streaming for WebDataset, columnar storage for Parquet). This architecture enables seamless integration with downstream ML pipelines.
Supports multiple output formats (WebDataset, Parquet, LMDB, TFRecord) with format-specific optimizations, enabling single pipeline to produce datasets compatible with different ML frameworks without post-processing
More flexible than single-format tools because it supports multiple output formats natively; more efficient than converting between formats post-hoc because optimizations are applied during writing
multiprocessing-based single-machine distribution
Medium confidenceThe multiprocessing distributor allocates work units across multiple CPU cores on a single machine using Python's multiprocessing module. It spawns worker processes that each run independent Downloader instances, coordinating through a shared work queue and logger process. This strategy maximizes hardware utilization for datasets that fit within single-machine resources, avoiding the overhead of distributed computing frameworks.
Uses Python multiprocessing with per-worker thread pools for concurrent HTTP downloads, combining process-level parallelism for CPU work with thread-level parallelism for I/O-bound network requests
Simpler to set up than Spark or Ray for single-machine use cases; lower overhead than distributed frameworks for datasets under 10M images; no external cluster infrastructure required
pyspark-based distributed dataset processing
Medium confidenceThe PySpark distributor scales image downloading across a Spark cluster by partitioning work units into RDDs and distributing them to Spark executors. Each executor runs a Downloader instance, with Spark handling fault tolerance, load balancing, and resource management. This strategy enables processing of massive datasets (billions of images) across commodity clusters while providing automatic recovery from node failures.
Integrates with Spark's RDD partitioning and executor model, leveraging Spark's fault tolerance and load balancing for billion-scale image downloads without custom distributed coordination logic
More scalable than multiprocessing for datasets >10M images; provides automatic fault tolerance and recovery unlike Ray; integrates with existing Spark infrastructure in enterprises
ray-based cloud-distributed dataset processing
Medium confidenceThe Ray distributor scales image downloading across Ray clusters (on-premises or cloud-based) by creating remote tasks that execute Downloader instances on Ray workers. Ray handles dynamic resource allocation, auto-scaling, and fault recovery. This strategy enables elastic scaling on cloud platforms (AWS, GCP, Azure) with minimal infrastructure management, supporting both on-demand and spot instances.
Uses Ray's task-based execution model with dynamic resource allocation, enabling elastic cloud scaling and spot instance support without explicit cluster management code
More cloud-native than Spark with better auto-scaling support; simpler to set up than Spark for cloud deployments; supports dynamic resource allocation that Spark requires manual configuration for
real-time pipeline monitoring and statistics logging
Medium confidenceThe Logger component monitors the entire download pipeline in real-time, collecting statistics on download success rates, processing speed, error types, and resource utilization. It runs as a separate process to avoid blocking worker threads, aggregating metrics from all workers and writing periodic reports. The logger provides visibility into pipeline health, enabling detection of bottlenecks, network issues, or configuration problems.
Runs as separate process to avoid blocking worker threads, aggregating real-time statistics from all workers with minimal performance overhead while providing comprehensive pipeline visibility
More integrated than external monitoring tools because it has direct access to pipeline internals; lower overhead than application-level instrumentation because it runs in separate process
incremental download with resume and deduplication
Medium confidenceThe pipeline supports resuming interrupted downloads by tracking completed work units and skipping already-processed images. It uses metadata (URLs, hashes) to detect duplicates across runs, avoiding redundant downloads. This capability enables long-running pipelines to recover from failures without reprocessing, and supports incremental dataset growth by appending new images to existing datasets.
Tracks completion state per work unit and uses hash-based deduplication to enable resuming interrupted pipelines and incrementally growing datasets without reprocessing
More efficient than restarting from scratch because it skips completed work; more robust than manual tracking because state is managed automatically by the pipeline
configurable http headers and robots.txt compliance checking
Medium confidenceThe downloader supports custom HTTP headers (User-Agent, Authorization, etc.) for accessing protected or restricted image sources. It integrates robots.txt checking to respect website crawling directives, parsing robots.txt files and validating URLs against allow/disallow rules before downloading. This enables ethical dataset creation while supporting authentication-protected image sources.
Integrates robots.txt parsing and validation directly into download pipeline, checking compliance before HTTP requests and supporting custom headers for authentication-protected sources
More ethical than tools ignoring robots.txt; supports authentication unlike basic wget; integrated into pipeline rather than requiring separate compliance checking step
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with img2dataset, ranked by overlap. Discovered automatically through the match graph.
CLIP-Interrogator
CLIP-Interrogator — AI demo on HuggingFace
Cre8tiveAI
Cre8tiveAI is an AI-based SaaS platform that offers a wide range of creative tools for photo, illustration, and video editing...
Icecream Apps Ltd
Versatile suite of user-friendly digital tools for everyday...
Creatie
Revolutionize design with AI, automation, and collaborative...
Ad Morph AI
Enhances the quality and appeal of ad images with a single...
Imagen AI
Revolutionize content with AI-driven image, video...
Best For
- ✓ML engineers building large-scale vision datasets
- ✓researchers working with web-scraped image collections
- ✓teams migrating from manual dataset curation to automated pipelines
- ✓teams building large-scale web-scraped datasets
- ✓researchers downloading public image collections
- ✓ML practitioners creating training datasets from URL lists
- ✓ML engineers preparing datasets for specific model architectures
- ✓teams standardizing image dimensions across heterogeneous sources
Known Limitations
- ⚠Requires input URLs to be in supported formats; custom formats need preprocessing
- ⚠Metadata extraction is limited to fields present in input file; cannot infer missing metadata
- ⚠Feather file intermediate storage adds disk I/O overhead for very small datasets (<1000 images)
- ⚠Thread pool concurrency is limited by GIL in CPython; actual parallelism depends on I/O blocking
- ⚠No built-in rate limiting per domain; aggressive downloading may trigger IP bans
- ⚠robots.txt checking is advisory only; does not enforce legal compliance
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
Package Details
About
Easily turn a set of image urls to an image dataset
Categories
Alternatives to img2dataset
Are you the builder of img2dataset?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →