DataFlex by OpenDCAI

Enhance LLM training with dynamic data scheduling

Created 10 months ago

954 stars

Top 38.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

DataFlex is a dynamic training framework for Large Language Models (LLMs) that enhances model performance and experimental reproducibility. It targets researchers and developers by intelligently scheduling training data through selection, mixture, and reweighting strategies, integrating several complex techniques into a unified, user-friendly system.

How It Works

Built upon LLaMA-Factory, DataFlex dynamically optimizes training data during the LLM training loop. It integrates and provides reproducible implementations for Data Selection (e.g., gradient-based, loss-based, distribution-based methods like LESS, NICE, TSDS), Data Mixture (e.g., DOREMI, ODM for adjusting domain ratios), and Data Reweighting (e.g., loss-based sample emphasis). This unified approach offers more flexible and powerful training control compared to static methods.

Quick Start & Requirements

Installation is straightforward via pip: pip install dataflex. For development, clone the repository and install from source. Python 3.11+ is recommended; Python 3.10 users may need manual installation of llamafactory. Training is initiated using a dataflex-cli command with YAML configuration files, similar to LLaMA-Factory, but requiring DataFlex-specific parameters. Official documentation is available at https://OpenDCAI.github.io/DataFlex-Doc/.

Highlighted Details

A technical report achieved the #1 rank on the Hugging Face Daily Papers leaderboard on April 4, 2026.
Supports gradient computation under DeepSpeed ZeRO-3, enabling training of larger-scale models.
Experimental results demonstrate performance improvements over default LLaMA-Factory training, with data selection/reweighting outperforming random baselines on MMLU benchmarks, and data mixture methods achieving higher MMLU accuracy and lower perplexity.

Maintenance & Community

The project welcomes contributions via GitHub Pull Requests and encourages users to report bugs or suggest features via GitHub Issues. Community groups are available for discussion and collaboration. Zhongguancun Academy provides API and GPU support.

Licensing & Compatibility

The project's license is detailed in the LICENSE file within the repository, accessible via https://github.com/OpenDCAI/DataFlex/blob/main/LICENSE. It offers full compatibility and acts as a drop-in replacement for LLaMA-Factory.

Limitations & Caveats

Official repositories for several integrated algorithms are marked as having issues or being unavailable. The project was first announced in late 2025, indicating it is relatively new.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

333 stars in the last 30 days