datachain  by iterative

AI-data warehouse for unstructured data transformation and analysis

Created 1 year ago
2,664 stars

Top 17.7% on SourcePulse

GitHubView on GitHub
Project Summary

DataChain is a Python-based AI data warehouse designed for transforming and analyzing unstructured data like images, audio, video, text, and PDFs. It targets data scientists and engineers working with large, multimodal datasets, offering efficient ETL, analytics, and versioning without data duplication by referencing external storage.

How It Works

DataChain operates by creating a columnar dataset that unifies file references and metadata from external storage (S3, GCP, Azure, local). It uses a Pythonic API for defining data transformations, including applying AI models and LLMs, and performs vectorized operations directly on these Python objects. This approach avoids data movement and duplication, enabling scalable, memory-efficient processing without requiring SQL or Spark.

Quick Start & Requirements

Highlighted Details

  • Multimodal dataset versioning without data copies, supporting various file types and cloud storage.
  • Python-friendly API for operating on data objects and running high-scale computations.
  • Data enrichment via local AI models and LLM APIs, with filtering, joining, and vector search capabilities.
  • High-performance vectorized operations and integration with PyTorch/TensorFlow.

Maintenance & Community

Licensing & Compatibility

  • License not explicitly stated in the README.
  • Compatible with commercial use and closed-source linking, assuming a permissive license.

Limitations & Caveats

  • The proprietary DataChain Studio platform is mentioned for teams needing centralized registries, data lineage, UI for multimodal data, and scalable compute for 100M+ files.
  • The specific open-source license is not detailed in the provided README.
Health Check
Last Commit

19 hours ago

Responsiveness

1 day

Pull Requests (30d)
51
Issues (30d)
16
Star History
51 stars in the last 30 days

Explore Similar Projects

Starred by Dominik Moritz Dominik Moritz(Research Scientist at Apple; Professor at CMU) and Casey Caruso Casey Caruso(Managing Partner of Topology Ventures).

latent-scope by enjalot

0%
726
Scientific tool for latent space investigation
Created 2 years ago
Updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alexander Wettig Alexander Wettig(Coauthor of SWE-bench, SWE-agent), and
5 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anton Troynikov Anton Troynikov(Cofounder of Chroma), and
44 more.

llama_index by run-llama

0.3%
44k
Data framework for building LLM-powered agents
Created 2 years ago
Updated 21 hours ago
Feedback? Help us improve.