FlagData by FlagOpen

Data processing toolkit for AI model training and deployment

Created 3 years ago

363 stars

Top 77.8% on SourcePulse

Project Summary

FlagData is an open-source toolkit designed for efficient and high-quality data processing in AI development, particularly for Natural Language Processing (NLP) and Computer Vision. It caters to both novice and advanced users by offering a flexible, one-stop solution for data acquisition, preparation, preprocessing, and analysis, aiming to reduce data processing costs and improve model training outcomes.

How It Works

FlagData provides a modular pipeline with a diverse operator pool for custom data construction. It leverages techniques like fastText for language identification, BERT and fastText for quality assessment, and MinHashLSH with Spark for distributed deduplication. The toolkit supports various data types and offers pre-built cleaning tasks for formats like HTML, Text, Books, and Arxiv papers, enabling users to build custom LLM pre-training data pipelines.

Quick Start & Requirements

Install via pip: pip install -r requirements.txt or git clone https://github.com/FlagOpen/FlagData.git for the main branch.
Dependencies are listed in requirements.txt.
Official documentation and examples are available via links within the README.

Highlighted Details

Supports multiple data types including HTML, Web, Wiki, Book, Paper, QA, and Redpajama.
Offers dozens of customizable operators for DIY data construction processes.
Features one-click high-quality data generation for various formats.
Integrates with Spark for distributed data processing capabilities.

Maintenance & Community

The project is actively maintained, with recent updates including v3.0.0. Community support is encouraged through GitHub Issues, Discussions, and direct email contact (data@baai.ac.cn). Regular online/offline exchanges with experts are planned.

Licensing & Compatibility

FlagData is licensed under the Apache 2.0 license, which permits commercial use and linking with closed-source projects.

Limitations & Caveats

While Spark integration is supported, users need to ensure a Spark cluster is available for distributed tasks. Some functions may require optimization for efficient execution in a distributed Spark environment.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days