Discover and explore top open-source AI tools and projects—updated daily.
Replicate ChatGPT's technical pipeline from scratch
Top 98.4% on SourcePulse
This project aims to replicate the technical pipeline of ChatGPT from scratch, targeting researchers and developers interested in understanding and reproducing large language model training. It provides a complete workflow from data collection and cleaning to pre-training, instruction fine-tuning, and RLHF, enabling users to experiment with scaling and optimization.
How It Works
The project follows a standard LLM training methodology, referencing Llama's architecture and Hugging Face's transformers
library. It processes 10B tokens for pre-training, uses 300k instruction-following examples for SFT, and 100k examples for RLHF. The chosen model size is 0.1B parameters, with an emphasis on making the code and workflow functional, allowing users to scale up with more data and larger models for improved performance.
Quick Start & Requirements
cuda 12.1
, pytorch
, transformers
, and deepspeed
. A requirements.txt
file is provided.Highlighted Details
Maintenance & Community
The project is a personal endeavor, with no specific community channels or roadmap mentioned.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.
Limitations & Caveats
The project notes that the 0.1B model size and limited SFT data resulted in suboptimal dialogue capabilities. RLHF results were also unsatisfactory, with model performance degrading over training steps, potentially due to learning rate or data quality issues.
1 year ago
Inactive