Discover and explore top open-source AI tools and projects—updated daily.
AI-Study-HanReplicate ChatGPT's technical pipeline from scratch
Top 96.7% on SourcePulse
This project aims to replicate the technical pipeline of ChatGPT from scratch, targeting researchers and developers interested in understanding and reproducing large language model training. It provides a complete workflow from data collection and cleaning to pre-training, instruction fine-tuning, and RLHF, enabling users to experiment with scaling and optimization.
How It Works
The project follows a standard LLM training methodology, referencing Llama's architecture and Hugging Face's transformers library. It processes 10B tokens for pre-training, uses 300k instruction-following examples for SFT, and 100k examples for RLHF. The chosen model size is 0.1B parameters, with an emphasis on making the code and workflow functional, allowing users to scale up with more data and larger models for improved performance.
Quick Start & Requirements
cuda 12.1, pytorch, transformers, and deepspeed. A requirements.txt file is provided.Highlighted Details
Maintenance & Community
The project is a personal endeavor, with no specific community channels or roadmap mentioned.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.
Limitations & Caveats
The project notes that the 0.1B model size and limited SFT data resulted in suboptimal dialogue capabilities. RLHF results were also unsatisfactory, with model performance degrading over training steps, potentially due to learning rate or data quality issues.
1 year ago
Inactive
shm007g
jondurbin
multimodal-art-projection
minimaxir
oumi-ai