data_engineering_book by datascale-ai

LLM Data Engineering Resource

Created 5 months ago

1,245 stars

Top 31.0% on SourcePulse

Project Summary

This repository provides a comprehensive book addressing the critical gap in systematic resources for Large Language Model (LLM) data engineering. It targets LLM R&D engineers, data engineers, MLOps professionals, and technical AI product managers seeking to master the data lifecycle for advanced AI models. The book offers a structured approach, bridging theoretical foundations with practical application, enabling users to immediately leverage best practices in LLM data refinement.

How It Works

The book systematically covers the entire LLM data engineering pipeline, from raw data acquisition and pre-training corpus cleaning to multi-modal data processing, alignment data construction (SFT, RLHF, CoT), and Retrieval Augmented Generation (RAG) data pipelines. It integrates the Data-Centric AI philosophy, emphasizing data quality's impact on model performance. The approach is grounded in modern, scalable technologies and includes five end-to-end practical projects with runnable code and detailed architecture designs.

Quick Start & Requirements

Prerequisites: Python 3.8+
Installation: Clone the repository and install dependencies using pip install mkdocs-material mkdocs-glightbox pymdown-extensions "mkdocs-static-i18n[material]".
Local Preview: Run mkdocs serve to preview the book at http://127.0.0.1:8000.
Build Static Site: Execute mkdocs build.
Online Reading: Available at https://datascale-ai.github.io/data_engineering_book/

Highlighted Details

Comprehensive Scope: Covers LLM data lifecycle from pre-training to fine-tuning, RLHF, and RAG, including advanced topics like Scaling Laws and multi-modal alignment.
Modern Tech Stack: Utilizes distributed computing (Ray Data, Spark), data storage (Parquet, WebDataset, Vector DBs), text processing (Trafilatura, KenLM, MinHash LSH), multi-modal tools (CLIP, ColPali, img2dataset), and data versioning (DVC, LakeFS).
Practical Projects: Features five end-to-end projects, including building a "Mini-C4" dataset, legal expert SFT, LLaVA multi-modal dataset, synthetic math/code textbooks, and a multi-modal RAG financial report assistant.

Maintenance & Community

Contributions via Issues and Pull Requests are welcomed. Further community interaction can be facilitated through GitHub Issues for specific questions.

Licensing & Compatibility

The project is licensed under the MIT License, which permits broad use, including commercial applications and integration into closed-source projects.

Limitations & Caveats

No specific limitations, alpha status, or known bugs are mentioned in the provided README. The content appears to be a stable, comprehensive resource.

data_engineering_book by datascale-ai

Explore Similar Projects

flock by dais-polymtl

deltacat by ray-project

beyondllm by aiplanethub

ezdata by xuwei95

awesome-llm-and-aigc by coderonion

cube-studio by data-infra

awesome-ml by underlines

data-prep-kit by data-prep-kit

lingua by facebookresearch

data-juicer by datajuicer

Daft by Eventual-Inc

pathway by pathwaycom