FlagData  by FlagOpen

Data processing toolkit for AI model training and deployment

created 2 years ago
348 stars

Top 80.9% on sourcepulse

GitHubView on GitHub
Project Summary

FlagData is an open-source toolkit designed for efficient and high-quality data processing in AI development, particularly for Natural Language Processing (NLP) and Computer Vision. It caters to both novice and advanced users by offering a flexible, one-stop solution for data acquisition, preparation, preprocessing, and analysis, aiming to reduce data processing costs and improve model training outcomes.

How It Works

FlagData provides a modular pipeline with a diverse operator pool for custom data construction. It leverages techniques like fastText for language identification, BERT and fastText for quality assessment, and MinHashLSH with Spark for distributed deduplication. The toolkit supports various data types and offers pre-built cleaning tasks for formats like HTML, Text, Books, and Arxiv papers, enabling users to build custom LLM pre-training data pipelines.

Quick Start & Requirements

  • Install via pip: pip install -r requirements.txt or git clone https://github.com/FlagOpen/FlagData.git for the main branch.
  • Dependencies are listed in requirements.txt.
  • Official documentation and examples are available via links within the README.

Highlighted Details

  • Supports multiple data types including HTML, Web, Wiki, Book, Paper, QA, and Redpajama.
  • Offers dozens of customizable operators for DIY data construction processes.
  • Features one-click high-quality data generation for various formats.
  • Integrates with Spark for distributed data processing capabilities.

Maintenance & Community

The project is actively maintained, with recent updates including v3.0.0. Community support is encouraged through GitHub Issues, Discussions, and direct email contact (data@baai.ac.cn). Regular online/offline exchanges with experts are planned.

Licensing & Compatibility

FlagData is licensed under the Apache 2.0 license, which permits commercial use and linking with closed-source projects.

Limitations & Caveats

While Spark integration is supported, users need to ensure a Spark cluster is available for distributed tasks. Some functions may require optimization for efficient execution in a distributed Spark environment.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 21 hours ago
Feedback? Help us improve.