Chinese LLM research paper, trained from scratch on 1T tokens
Top 49.6% on sourcepulse
Steel-LLM is a personal project focused on training a 1 billion parameter, Chinese-centric Large Language Model (LLM) from scratch using 1 trillion tokens. It aims to democratize LLM training by providing a detailed, open-source account of the entire process, enabling individuals with moderate GPU resources (8-dozens of cards) to replicate the work. The project offers a fully trained model and comprehensive documentation on data collection, processing, and training frameworks.
How It Works
The model is built upon the Qwen1.5 architecture, with modifications including a softmax MoE in the FFN layers for faster training at similar parameter counts and the use of dual SwiGLU layers. The training framework is an enhanced version of TinyLlama's, incorporating HuggingFace model compatibility, robust checkpointing, data consistency checks, and the ability to append new data without disrupting existing training progress. The project leverages a diverse dataset, with over 80% being Chinese, to achieve strong performance on Chinese benchmarks.
Quick Start & Requirements
modelscope
library for inference.modelscope
, torch
.Highlighted Details
Steel-LLM-chat-v2
.Maintenance & Community
The project is actively updated, with recent work focusing on reinforcement learning and model fine-tuning. A WeChat group is available for community discussion.
Licensing & Compatibility
The model is available on HuggingFace under the zhanshijin/Steel-LLM
repository. Licensing details for commercial use are not explicitly stated in the README.
Limitations & Caveats
The project primarily focuses on Chinese language performance, with less emphasis on English benchmarks. While the training process is detailed, replicating the full 1T token training requires significant computational resources and time.
3 months ago
1+ week