datachain  by iterative

AI-data warehouse for unstructured data transformation and analysis

Created 1 year ago
2,694 stars

Top 17.5% on SourcePulse

GitHubView on GitHub
Project Summary

DataChain is a Python-based AI data warehouse designed for transforming and analyzing unstructured data like images, audio, video, text, and PDFs. It targets data scientists and engineers working with large, multimodal datasets, offering efficient ETL, analytics, and versioning without data duplication by referencing external storage.

How It Works

DataChain operates by creating a columnar dataset that unifies file references and metadata from external storage (S3, GCP, Azure, local). It uses a Pythonic API for defining data transformations, including applying AI models and LLMs, and performs vectorized operations directly on these Python objects. This approach avoids data movement and duplication, enabling scalable, memory-efficient processing without requiring SQL or Spark.

Quick Start & Requirements

Highlighted Details

  • Multimodal dataset versioning without data copies, supporting various file types and cloud storage.
  • Python-friendly API for operating on data objects and running high-scale computations.
  • Data enrichment via local AI models and LLM APIs, with filtering, joining, and vector search capabilities.
  • High-performance vectorized operations and integration with PyTorch/TensorFlow.

Maintenance & Community

Licensing & Compatibility

  • License not explicitly stated in the README.
  • Compatible with commercial use and closed-source linking, assuming a permissive license.

Limitations & Caveats

  • The proprietary DataChain Studio platform is mentioned for teams needing centralized registries, data lineage, UI for multimodal data, and scalable compute for 100M+ files.
  • The specific open-source license is not detailed in the provided README.
Health Check
Last Commit

23 hours ago

Responsiveness

1 day

Pull Requests (30d)
57
Issues (30d)
15
Star History
23 stars in the last 30 days

Explore Similar Projects

Starred by Dominik Moritz Dominik Moritz(Research Scientist at Apple; Professor at CMU) and Casey Caruso Casey Caruso(Managing Partner of Topology Ventures).

latent-scope by enjalot

0.1%
734
Scientific tool for latent space investigation
Created 2 years ago
Updated 5 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alexander Wettig Alexander Wettig(Coauthor of SWE-bench, SWE-agent), and
5 more.

data-juicer by modelscope

1.0%
5k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 11 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anton Troynikov Anton Troynikov(Cofounder of Chroma), and
47 more.

llama_index by run-llama

0.3%
45k
Data framework for building LLM-powered agents
Created 3 years ago
Updated 21 hours ago
Feedback? Help us improve.