img2dataset by rom1504

CLI tool for creating large image datasets from URLs

Created 4 years ago

4,340 stars

Top 11.2% on SourcePulse

View on GitHub

17 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Anastasios Angelopoulos

Cofounder of LMArena

Phil Wang

Prolific Research Paper Implementer

Jesse Clark

Cofounder of Marqo

and 13 more!

Project Summary

This tool addresses the challenge of efficiently downloading, resizing, and packaging large datasets of images from URLs, primarily for machine learning applications. It targets researchers and engineers working with massive image collections, offering significant speedups and flexible output formats.

How It Works

img2dataset employs a multi-process, multi-threaded architecture to maximize download and processing throughput. It splits URL lists into shards, distributing them across multiple processes, each with numerous threads for concurrent downloading. Resizing is handled by parent processes to balance CPU utilization. This design prioritizes network bandwidth and CPU efficiency, with optional optimizations for DNS resolution to prevent bottlenecks.

Quick Start & Requirements

Install: pip install img2dataset
Requirements: Python, optional fast DNS resolver for optimal performance.
Usage: img2dataset --url_list=myimglist.txt --output_folder=output_folder --thread_count=64 --image_size=256
More examples and detailed API documentation are available in the repository.

Highlighted Details

Capable of processing 100M URLs in 20 hours on a single machine.
Supports multiple output formats: files, webdataset, parquet, and tfrecord.
Includes robust filtering options for image size, area, and aspect ratio.
Offers incremental download capabilities for interrupted processes.
Integrates with Weights & Biases for performance monitoring and logging.

Maintenance & Community

The project is actively maintained by Romain Beaumont. A community chat is available via DataToML for contributions and discussions.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.

Limitations & Caveats

The README mentions that standard file systems can experience performance issues with over 1 million files; the webdataset format is recommended for larger datasets. While performance is high, achieving optimal speeds may require advanced DNS resolver configuration.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

103 stars in the last 30 days