Discover and explore top open-source AI tools and projects—updated daily.
Data preparation and LLM training system
Top 30.6% on SourcePulse
DataFlow is a data-centric AI system designed for comprehensive data preparation and LLM training. It targets researchers and developers working with large language models, offering a modular framework to improve LLM performance in specific domains through targeted data processing and pipeline assembly.
How It Works
DataFlow employs a modular operator design, allowing users to build flexible data processing pipelines by combining various operators. These operators, categorized into Generic, Domain-Specific, and Evaluation types, handle tasks from text processing to domain-specific data manipulation. An intelligent DataFlow-agent can dynamically assemble new pipelines by recombining existing operators, enabling automated data workflow orchestration.
Quick Start & Requirements
pip install open-dataflow
pip install open-dataflow[vllm]
dataflow webui
and dataflow webui agent
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is newly released (June 2025) and may still be undergoing rapid development. Specific performance benchmarks are detailed in the documentation, but real-world performance may vary.
16 hours ago
Inactive