Zero-Chatgpt  by AI-Study-Han

Replicate ChatGPT's technical pipeline from scratch

Created 1 year ago
257 stars

Top 98.4% on SourcePulse

GitHubView on GitHub
Project Summary

This project aims to replicate the technical pipeline of ChatGPT from scratch, targeting researchers and developers interested in understanding and reproducing large language model training. It provides a complete workflow from data collection and cleaning to pre-training, instruction fine-tuning, and RLHF, enabling users to experiment with scaling and optimization.

How It Works

The project follows a standard LLM training methodology, referencing Llama's architecture and Hugging Face's transformers library. It processes 10B tokens for pre-training, uses 300k instruction-following examples for SFT, and 100k examples for RLHF. The chosen model size is 0.1B parameters, with an emphasis on making the code and workflow functional, allowing users to scale up with more data and larger models for improved performance.

Quick Start & Requirements

  • Installation: Requires cuda 12.1, pytorch, transformers, and deepspeed. A requirements.txt file is provided.
  • Resources: Training was conducted on 2x A40 GPUs, with pre-training taking approximately 2 days.
  • Data & Weights: Pre-trained weights, SFT weights, RLHF weights, and training/fine-tuning/RLHF environment images are available.

Highlighted Details

  • Comprehensive pipeline: Data collection, cleaning, tokenizer training, pre-training, SFT, and RLHF.
  • Data sources: Includes Chinese Wikipedia, Baidu Baike, and SkyPile-150B, with specific cleaning and deduplication steps.
  • Tokenizer: A 32,000 token vocabulary was trained, referencing Llama and Qwen conventions.
  • RLHF implementation: Based on DeepSpeed-Chat, with a Reward Model achieving 92% accuracy.

Maintenance & Community

The project is a personal endeavor, with no specific community channels or roadmap mentioned.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The project notes that the 0.1B model size and limited SFT data resulted in suboptimal dialogue capabilities. RLHF results were also unsatisfactory, with model performance degrading over training steps, potentially due to learning rate or data quality issues.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.