datachain  by iterative

AI-data warehouse for unstructured data transformation and analysis

created 1 year ago
2,616 stars

Top 18.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DataChain is a Python-based AI data warehouse designed for transforming and analyzing unstructured data like images, audio, video, text, and PDFs. It targets data scientists and engineers working with large, multimodal datasets, offering efficient ETL, analytics, and versioning without data duplication by referencing external storage.

How It Works

DataChain operates by creating a columnar dataset that unifies file references and metadata from external storage (S3, GCP, Azure, local). It uses a Pythonic API for defining data transformations, including applying AI models and LLMs, and performs vectorized operations directly on these Python objects. This approach avoids data movement and duplication, enabling scalable, memory-efficient processing without requiring SQL or Spark.

Quick Start & Requirements

Highlighted Details

  • Multimodal dataset versioning without data copies, supporting various file types and cloud storage.
  • Python-friendly API for operating on data objects and running high-scale computations.
  • Data enrichment via local AI models and LLM APIs, with filtering, joining, and vector search capabilities.
  • High-performance vectorized operations and integration with PyTorch/TensorFlow.

Maintenance & Community

Licensing & Compatibility

  • License not explicitly stated in the README.
  • Compatible with commercial use and closed-source linking, assuming a permissive license.

Limitations & Caveats

  • The proprietary DataChain Studio platform is mentioned for teams needing centralized registries, data lineage, UI for multimodal data, and scalable compute for 100M+ files.
  • The specific open-source license is not detailed in the provided README.
Health Check
Last commit

21 hours ago

Responsiveness

1 day

Pull Requests (30d)
57
Issues (30d)
26
Star History
89 stars in the last 90 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

NeumAI by NeumTry

0%
858
Data platform for retrieval-augmented generation (RAG)
created 1 year ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anton Troynikov Anton Troynikov(Cofounder of Chroma), and
20 more.

llama_index by run-llama

0.3%
43k
Data framework for building LLM-powered agents
created 2 years ago
updated 19 hours ago
Feedback? Help us improve.