awesome-llm-pretraining by RUCAIBox

Curated resources for LLM pre-training

Created 1 year ago

391 stars

Top 73.1% on SourcePulse

Project Summary

This repository serves as a comprehensive, curated guide to resources for Large Language Model (LLM) pre-training. It targets developers, researchers, and practitioners in the open-source LLM community, offering a structured overview of essential data, frameworks, and methodologies. The primary benefit is enabling users to quickly access and understand the landscape of LLM pre-training, from foundational technical reports to cutting-edge techniques and datasets.

How It Works

The project functions as a living document, meticulously organizing and linking to a vast array of LLM pre-training resources. It categorizes information into key areas: Technical Reports (detailing model architectures like Dense, MoE, and hybrid), Training Strategies (covering frameworks, parallelism, optimizers, and FP8), Open-source Datasets (web pages, math, code, general-purpose), and Data Methods (tokenizers, data mixing, synthesis). This structured approach aims to provide clarity and efficiency for users navigating the complex domain of LLM pre-training.

Quick Start & Requirements

This repository is a curated collection of links to papers, code, and datasets. It does not involve direct installation or execution. Users are directed to follow the provided links to access and utilize the individual resources.

Highlighted Details

Extensive Model Coverage: Features technical reports and papers for major LLM series including LLaMA, Qwen, DeepSeek, Gemma, Mistral, Phi, and many others, covering diverse architectures like Mixture-of-Experts (MoE).
End-to-End Training Resources: Encompasses critical training components such as frameworks (Megatron-LM, Deepspeed), advanced training strategies (scaling laws, FP8, parallelism), and interpretability methods.
Rich Data Landscape: Provides access to a wide range of open-source datasets, categorized by source (web pages, code, mathematics) and purpose, alongside detailed data synthesis and tokenization techniques.
Focus on Advancements: Continuously tracks cutting-edge developments and commonly used resources to keep the community informed about the latest in LLM pre-training.

Maintenance & Community

The project actively encourages community contributions through GitHub Issues and Pull Requests to expand and update its resource listings, fostering collaborative development in the LLM pre-training space. Specific community channels or active maintainer details are not provided in the excerpt.

Licensing & Compatibility

The repository itself is likely under a permissive open-source license (common for "awesome" lists), though not explicitly stated. However, it aggregates links to numerous external projects, each with its own distinct licensing terms that users must independently review for compatibility, especially for commercial use.

Limitations & Caveats

As a curated list, its value is directly tied to the diligence of its maintainers and community contributions in keeping resources current and comprehensive. Users must independently vet the quality and applicability of linked resources and manage the setup and dependencies of each external project.

awesome-llm-pretraining by RUCAIBox

Explore Similar Projects

Re-Zero---Starting-LLM- by jiaran-king

LLaMA-Cult-and-More by shm007g

LLM-Synthetic-Data by pengr

from-minimind-to-more by Tongyun1

awesome-production-llm by jihoo-kim

llm-resource by liguodongiot

LLM-Open-University-From-Begineer-to-Advanced by youssefHosni

LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing by ghimiresunil

LLMs-Zero-to-Hero by bbruceyuan

edu by wandb

LLM-workshop-2024 by rasbt

llm-course by mlabonne