Zero-Chatgpt by AI-Study-Han

Replicate ChatGPT's technical pipeline from scratch

Created 1 year ago

271 stars

Top 95.1% on SourcePulse

Project Summary

This project aims to replicate the technical pipeline of ChatGPT from scratch, targeting researchers and developers interested in understanding and reproducing large language model training. It provides a complete workflow from data collection and cleaning to pre-training, instruction fine-tuning, and RLHF, enabling users to experiment with scaling and optimization.

How It Works

The project follows a standard LLM training methodology, referencing Llama's architecture and Hugging Face's transformers library. It processes 10B tokens for pre-training, uses 300k instruction-following examples for SFT, and 100k examples for RLHF. The chosen model size is 0.1B parameters, with an emphasis on making the code and workflow functional, allowing users to scale up with more data and larger models for improved performance.

Quick Start & Requirements

Installation: Requires cuda 12.1, pytorch, transformers, and deepspeed. A requirements.txt file is provided.
Resources: Training was conducted on 2x A40 GPUs, with pre-training taking approximately 2 days.
Data & Weights: Pre-trained weights, SFT weights, RLHF weights, and training/fine-tuning/RLHF environment images are available.

Highlighted Details

Comprehensive pipeline: Data collection, cleaning, tokenizer training, pre-training, SFT, and RLHF.
Data sources: Includes Chinese Wikipedia, Baidu Baike, and SkyPile-150B, with specific cleaning and deduplication steps.
Tokenizer: A 32,000 token vocabulary was trained, referencing Llama and Qwen conventions.
RLHF implementation: Based on DeepSpeed-Chat, with a Reward Model achieving 92% accuracy.

Maintenance & Community

The project is a personal endeavor, with no specific community channels or roadmap mentioned.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The project notes that the 0.1B model size and limited SFT data resulted in suboptimal dialogue capabilities. RLHF results were also unsatisfactory, with model performance degrading over training steps, potentially due to learning rate or data quality issues.

Zero-Chatgpt by AI-Study-Han

Explore Similar Projects

LLaMA-Cult-and-More by shm007g

bagel by jondurbin

create-llm by theaniketgiri

bce-qianfan-sdk by baidubce

MINI_LLM by jiahe7ay

MAP-NEO by multimodal-art-projection

train-llm-from-scratch by FareedKhan-dev

bert4torch by Tongjilibo

aitextgen by minimaxir

MedicalGPT by shibing624

oumi by oumi-ai

Qwen by QwenLM