Discover and explore top open-source AI tools and projects—updated daily.
OpenDCAIEnhance LLM training with dynamic data scheduling
Top 80.6% on SourcePulse
DataFlex is a dynamic training framework for Large Language Models (LLMs) that enhances model performance and experimental reproducibility. It targets researchers and developers by intelligently scheduling training data through selection, mixture, and reweighting strategies, integrating several complex techniques into a unified, user-friendly system.
How It Works
Built upon LLaMA-Factory, DataFlex dynamically optimizes training data during the LLM training loop. It integrates and provides reproducible implementations for Data Selection (e.g., gradient-based, loss-based, distribution-based methods like LESS, NICE, TSDS), Data Mixture (e.g., DOREMI, ODM for adjusting domain ratios), and Data Reweighting (e.g., loss-based sample emphasis). This unified approach offers more flexible and powerful training control compared to static methods.
Quick Start & Requirements
Installation is straightforward via pip: pip install dataflex. For development, clone the repository and install from source. Python 3.11+ is recommended; Python 3.10 users may need manual installation of llamafactory. Training is initiated using a dataflex-cli command with YAML configuration files, similar to LLaMA-Factory, but requiring DataFlex-specific parameters. Official documentation is available at https://OpenDCAI.github.io/DataFlex-Doc/.
Highlighted Details
Maintenance & Community
The project welcomes contributions via GitHub Pull Requests and encourages users to report bugs or suggest features via GitHub Issues. Community groups are available for discussion and collaboration. Zhongguancun Academy provides API and GPU support.
Licensing & Compatibility
The project's license is detailed in the LICENSE file within the repository, accessible via https://github.com/OpenDCAI/DataFlex/blob/main/LICENSE. It offers full compatibility and acts as a drop-in replacement for LLaMA-Factory.
Limitations & Caveats
Official repositories for several integrated algorithms are marked as having issues or being unavailable. The project was first announced in late 2025, indicating it is relatively new.
5 days ago
Inactive
shm007g
XueFuzhao
mlfoundations
facebookresearch
mosaicml
tensorzero