DataFlex  by OpenDCAI

Enhance LLM training with dynamic data scheduling

Created 8 months ago
344 stars

Top 80.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DataFlex is a dynamic training framework for Large Language Models (LLMs) that enhances model performance and experimental reproducibility. It targets researchers and developers by intelligently scheduling training data through selection, mixture, and reweighting strategies, integrating several complex techniques into a unified, user-friendly system.

How It Works

Built upon LLaMA-Factory, DataFlex dynamically optimizes training data during the LLM training loop. It integrates and provides reproducible implementations for Data Selection (e.g., gradient-based, loss-based, distribution-based methods like LESS, NICE, TSDS), Data Mixture (e.g., DOREMI, ODM for adjusting domain ratios), and Data Reweighting (e.g., loss-based sample emphasis). This unified approach offers more flexible and powerful training control compared to static methods.

Quick Start & Requirements

Installation is straightforward via pip: pip install dataflex. For development, clone the repository and install from source. Python 3.11+ is recommended; Python 3.10 users may need manual installation of llamafactory. Training is initiated using a dataflex-cli command with YAML configuration files, similar to LLaMA-Factory, but requiring DataFlex-specific parameters. Official documentation is available at https://OpenDCAI.github.io/DataFlex-Doc/.

Highlighted Details

  • A technical report achieved the #1 rank on the Hugging Face Daily Papers leaderboard on April 4, 2026.
  • Supports gradient computation under DeepSpeed ZeRO-3, enabling training of larger-scale models.
  • Experimental results demonstrate performance improvements over default LLaMA-Factory training, with data selection/reweighting outperforming random baselines on MMLU benchmarks, and data mixture methods achieving higher MMLU accuracy and lower perplexity.

Maintenance & Community

The project welcomes contributions via GitHub Pull Requests and encourages users to report bugs or suggest features via GitHub Issues. Community groups are available for discussion and collaboration. Zhongguancun Academy provides API and GPU support.

Licensing & Compatibility

The project's license is detailed in the LICENSE file within the repository, accessible via https://github.com/OpenDCAI/DataFlex/blob/main/LICENSE. It offers full compatibility and acts as a drop-in replacement for LLaMA-Factory.

Limitations & Caveats

Official repositories for several integrated algorithms are marked as having issues or being unavailable. The project was first announced in late 2025, indicating it is relatively new.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
6
Star History
237 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
7 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 1 year ago
Updated 9 months ago
Feedback? Help us improve.