DataFlow  by OpenDCAI

Data preparation and LLM training system

Created 11 months ago
1,303 stars

Top 30.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DataFlow is a data-centric AI system designed for comprehensive data preparation and LLM training. It targets researchers and developers working with large language models, offering a modular framework to improve LLM performance in specific domains through targeted data processing and pipeline assembly.

How It Works

DataFlow employs a modular operator design, allowing users to build flexible data processing pipelines by combining various operators. These operators, categorized into Generic, Domain-Specific, and Evaluation types, handle tasks from text processing to domain-specific data manipulation. An intelligent DataFlow-agent can dynamically assemble new pipelines by recombining existing operators, enabling automated data workflow orchestration.

Quick Start & Requirements

  • Install via pip: pip install open-dataflow
  • For GPU inference: pip install open-dataflow[vllm]
  • Requires Python >= 3.10.
  • Offers Gradio web interfaces for operators and agents: dataflow webui and dataflow webui agent.
  • Documentation available at: https://OpenDCAI.github.io/DataFlow-Doc/

Highlighted Details

  • Empirically validated to improve domain-oriented LLM performance in healthcare, finance, and law.
  • Includes over 140 operators (80+ Generic, 40+ Domain-Specific, 20+ Evaluation).
  • Offers pre-built pipelines for Text, Reasoning, Text2SQL, Knowledge Base Cleaning, and Agentic RAG.
  • Features an agent capable of writing custom operators and orchestrating pipelines.

Maintenance & Community

  • Active development with recent release announcements.
  • Community support via GitHub Issues and Pull Requests.
  • Published research papers contributing to the system's components.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is newly released (June 2025) and may still be undergoing rapid development. Specific performance benchmarks are detailed in the documentation, but real-world performance may vary.

Health Check
Last Commit

16 hours ago

Responsiveness

Inactive

Pull Requests (30d)
33
Issues (30d)
8
Star History
163 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jerry Liu Jerry Liu(Cofounder of LlamaIndex), and
1 more.

sparrow by katanaml

0.1%
5k
Data processing & instruction calling tool using ML, LLM, and Vision LLM
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.