data_engineering_book  by datascale-ai

LLM Data Engineering Resource

Created 3 weeks ago

New!

693 stars

Top 49.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive book addressing the critical gap in systematic resources for Large Language Model (LLM) data engineering. It targets LLM R&D engineers, data engineers, MLOps professionals, and technical AI product managers seeking to master the data lifecycle for advanced AI models. The book offers a structured approach, bridging theoretical foundations with practical application, enabling users to immediately leverage best practices in LLM data refinement.

How It Works

The book systematically covers the entire LLM data engineering pipeline, from raw data acquisition and pre-training corpus cleaning to multi-modal data processing, alignment data construction (SFT, RLHF, CoT), and Retrieval Augmented Generation (RAG) data pipelines. It integrates the Data-Centric AI philosophy, emphasizing data quality's impact on model performance. The approach is grounded in modern, scalable technologies and includes five end-to-end practical projects with runnable code and detailed architecture designs.

Quick Start & Requirements

  • Prerequisites: Python 3.8+
  • Installation: Clone the repository and install dependencies using pip install mkdocs-material mkdocs-glightbox pymdown-extensions "mkdocs-static-i18n[material]".
  • Local Preview: Run mkdocs serve to preview the book at http://127.0.0.1:8000.
  • Build Static Site: Execute mkdocs build.
  • Online Reading: Available at https://datascale-ai.github.io/data_engineering_book/

Highlighted Details

  • Comprehensive Scope: Covers LLM data lifecycle from pre-training to fine-tuning, RLHF, and RAG, including advanced topics like Scaling Laws and multi-modal alignment.
  • Modern Tech Stack: Utilizes distributed computing (Ray Data, Spark), data storage (Parquet, WebDataset, Vector DBs), text processing (Trafilatura, KenLM, MinHash LSH), multi-modal tools (CLIP, ColPali, img2dataset), and data versioning (DVC, LakeFS).
  • Practical Projects: Features five end-to-end projects, including building a "Mini-C4" dataset, legal expert SFT, LLaVA multi-modal dataset, synthetic math/code textbooks, and a multi-modal RAG financial report assistant.

Maintenance & Community

Contributions via Issues and Pull Requests are welcomed. Further community interaction can be facilitated through GitHub Issues for specific questions.

Licensing & Compatibility

The project is licensed under the MIT License, which permits broad use, including commercial applications and integration into closed-source projects.

Limitations & Caveats

No specific limitations, alpha status, or known bugs are mentioned in the provided README. The content appears to be a stable, comprehensive resource.

Health Check
Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
23
Issues (30d)
4
Star History
700 stars in the last 25 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
7 more.

lingua by facebookresearch

0.0%
5k
LLM research codebase for training and inference
Created 1 year ago
Updated 7 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alexander Wettig Alexander Wettig(Coauthor of SWE-bench, SWE-agent), and
5 more.

data-juicer by datajuicer

0.5%
6k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

pathway by pathwaycom

0.1%
60k
Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG
Created 3 years ago
Updated 22 hours ago
Feedback? Help us improve.