CLI tool for creating large image datasets from URLs
Top 12.2% on sourcepulse
This tool addresses the challenge of efficiently downloading, resizing, and packaging large datasets of images from URLs, primarily for machine learning applications. It targets researchers and engineers working with massive image collections, offering significant speedups and flexible output formats.
How It Works
img2dataset employs a multi-process, multi-threaded architecture to maximize download and processing throughput. It splits URL lists into shards, distributing them across multiple processes, each with numerous threads for concurrent downloading. Resizing is handled by parent processes to balance CPU utilization. This design prioritizes network bandwidth and CPU efficiency, with optional optimizations for DNS resolution to prevent bottlenecks.
Quick Start & Requirements
pip install img2dataset
img2dataset --url_list=myimglist.txt --output_folder=output_folder --thread_count=64 --image_size=256
Highlighted Details
Maintenance & Community
The project is actively maintained by Romain Beaumont. A community chat is available via DataToML for contributions and discussions.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.
Limitations & Caveats
The README mentions that standard file systems can experience performance issues with over 1 million files; the webdataset
format is recommended for larger datasets. While performance is high, achieving optimal speeds may require advanced DNS resolver configuration.
1 year ago
Inactive