FlagData  by FlagOpen

Data processing toolkit for AI model training and deployment

Created 2 years ago
353 stars

Top 79.0% on SourcePulse

GitHubView on GitHub
Project Summary

FlagData is an open-source toolkit designed for efficient and high-quality data processing in AI development, particularly for Natural Language Processing (NLP) and Computer Vision. It caters to both novice and advanced users by offering a flexible, one-stop solution for data acquisition, preparation, preprocessing, and analysis, aiming to reduce data processing costs and improve model training outcomes.

How It Works

FlagData provides a modular pipeline with a diverse operator pool for custom data construction. It leverages techniques like fastText for language identification, BERT and fastText for quality assessment, and MinHashLSH with Spark for distributed deduplication. The toolkit supports various data types and offers pre-built cleaning tasks for formats like HTML, Text, Books, and Arxiv papers, enabling users to build custom LLM pre-training data pipelines.

Quick Start & Requirements

  • Install via pip: pip install -r requirements.txt or git clone https://github.com/FlagOpen/FlagData.git for the main branch.
  • Dependencies are listed in requirements.txt.
  • Official documentation and examples are available via links within the README.

Highlighted Details

  • Supports multiple data types including HTML, Web, Wiki, Book, Paper, QA, and Redpajama.
  • Offers dozens of customizable operators for DIY data construction processes.
  • Features one-click high-quality data generation for various formats.
  • Integrates with Spark for distributed data processing capabilities.

Maintenance & Community

The project is actively maintained, with recent updates including v3.0.0. Community support is encouraged through GitHub Issues, Discussions, and direct email contact (data@baai.ac.cn). Regular online/offline exchanges with experts are planned.

Licensing & Compatibility

FlagData is licensed under the Apache 2.0 license, which permits commercial use and linking with closed-source projects.

Limitations & Caveats

While Spark integration is supported, users need to ensure a Spark cluster is available for distributed tasks. Some functions may require optimization for efficient execution in a distributed Spark environment.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alexander Wettig Alexander Wettig(Coauthor of SWE-bench, SWE-agent), and
5 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.