img2dataset  by rom1504

CLI tool for creating large image datasets from URLs

created 4 years ago
4,113 stars

Top 12.2% on sourcepulse

GitHubView on GitHub
Project Summary

This tool addresses the challenge of efficiently downloading, resizing, and packaging large datasets of images from URLs, primarily for machine learning applications. It targets researchers and engineers working with massive image collections, offering significant speedups and flexible output formats.

How It Works

img2dataset employs a multi-process, multi-threaded architecture to maximize download and processing throughput. It splits URL lists into shards, distributing them across multiple processes, each with numerous threads for concurrent downloading. Resizing is handled by parent processes to balance CPU utilization. This design prioritizes network bandwidth and CPU efficiency, with optional optimizations for DNS resolution to prevent bottlenecks.

Quick Start & Requirements

  • Install: pip install img2dataset
  • Requirements: Python, optional fast DNS resolver for optimal performance.
  • Usage: img2dataset --url_list=myimglist.txt --output_folder=output_folder --thread_count=64 --image_size=256
  • More examples and detailed API documentation are available in the repository.

Highlighted Details

  • Capable of processing 100M URLs in 20 hours on a single machine.
  • Supports multiple output formats: files, webdataset, parquet, and tfrecord.
  • Includes robust filtering options for image size, area, and aspect ratio.
  • Offers incremental download capabilities for interrupted processes.
  • Integrates with Weights & Biases for performance monitoring and logging.

Maintenance & Community

The project is actively maintained by Romain Beaumont. A community chat is available via DataToML for contributions and discussions.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.

Limitations & Caveats

The README mentions that standard file systems can experience performance issues with over 1 million files; the webdataset format is recommended for larger datasets. While performance is high, achieving optimal speeds may require advanced DNS resolver configuration.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
108 stars in the last 90 days

Explore Similar Projects

Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Chenlin Meng Chenlin Meng(Cofounder of Pika), and
4 more.

clip-retrieval by rom1504

0.3%
3k
CLIP retrieval system for semantic search
created 4 years ago
updated 1 year ago
Feedback? Help us improve.