Steel-LLM  by zhanshijinwat

Chinese LLM research paper, trained from scratch on 1T tokens

created 1 year ago
703 stars

Top 49.6% on sourcepulse

GitHubView on GitHub
Project Summary

Steel-LLM is a personal project focused on training a 1 billion parameter, Chinese-centric Large Language Model (LLM) from scratch using 1 trillion tokens. It aims to democratize LLM training by providing a detailed, open-source account of the entire process, enabling individuals with moderate GPU resources (8-dozens of cards) to replicate the work. The project offers a fully trained model and comprehensive documentation on data collection, processing, and training frameworks.

How It Works

The model is built upon the Qwen1.5 architecture, with modifications including a softmax MoE in the FFN layers for faster training at similar parameter counts and the use of dual SwiGLU layers. The training framework is an enhanced version of TinyLlama's, incorporating HuggingFace model compatibility, robust checkpointing, data consistency checks, and the ability to append new data without disrupting existing training progress. The project leverages a diverse dataset, with over 80% being Chinese, to achieve strong performance on Chinese benchmarks.

Quick Start & Requirements

  • Install/Run: Use modelscope library for inference.
  • Prerequisites: Python, modelscope, torch.
  • Hardware: Training utilized 8x H800 80GB GPUs for ~30 days or 8x A100 80GB GPUs for ~60 days.
  • Links: Modelscope Inference, Technical Report

Highlighted Details

  • Achieved 41.9 CEVAL and 36.08 CMMLU scores with Steel-LLM-chat-v2.
  • Model architecture based on Qwen1.5 with softmax MoE and dual SwiGLU.
  • Training process detailed in a series of blog posts and a technical report.
  • ICLR 2025 workshop paper acceptance.

Maintenance & Community

The project is actively updated, with recent work focusing on reinforcement learning and model fine-tuning. A WeChat group is available for community discussion.

Licensing & Compatibility

The model is available on HuggingFace under the zhanshijin/Steel-LLM repository. Licensing details for commercial use are not explicitly stated in the README.

Limitations & Caveats

The project primarily focuses on Chinese language performance, with less emphasis on English benchmarks. While the training process is detailed, replicating the full 1T token training requires significant computational resources and time.

Health Check
Last commit

3 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
78 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.