Steel-LLM by zhanshijinwat

Chinese LLM research paper, trained from scratch on 1T tokens

Created 1 year ago

783 stars

Top 44.9% on SourcePulse

Project Summary

Steel-LLM is a personal project focused on training a 1 billion parameter, Chinese-centric Large Language Model (LLM) from scratch using 1 trillion tokens. It aims to democratize LLM training by providing a detailed, open-source account of the entire process, enabling individuals with moderate GPU resources (8-dozens of cards) to replicate the work. The project offers a fully trained model and comprehensive documentation on data collection, processing, and training frameworks.

How It Works

The model is built upon the Qwen1.5 architecture, with modifications including a softmax MoE in the FFN layers for faster training at similar parameter counts and the use of dual SwiGLU layers. The training framework is an enhanced version of TinyLlama's, incorporating HuggingFace model compatibility, robust checkpointing, data consistency checks, and the ability to append new data without disrupting existing training progress. The project leverages a diverse dataset, with over 80% being Chinese, to achieve strong performance on Chinese benchmarks.

Quick Start & Requirements

Install/Run: Use modelscope library for inference.
Prerequisites: Python, modelscope, torch.
Hardware: Training utilized 8x H800 80GB GPUs for ~30 days or 8x A100 80GB GPUs for ~60 days.
Links: Modelscope Inference, Technical Report

Highlighted Details

Achieved 41.9 CEVAL and 36.08 CMMLU scores with Steel-LLM-chat-v2.
Model architecture based on Qwen1.5 with softmax MoE and dual SwiGLU.
Training process detailed in a series of blog posts and a technical report.
ICLR 2025 workshop paper acceptance.

Maintenance & Community

The project is actively updated, with recent work focusing on reinforcement learning and model fine-tuning. A WeChat group is available for community discussion.

Licensing & Compatibility

The model is available on HuggingFace under the zhanshijin/Steel-LLM repository. Licensing details for commercial use are not explicitly stated in the README.

Limitations & Caveats

The project primarily focuses on Chinese language performance, with less emphasis on English benchmarks. While the training process is detailed, replicating the full 1T token training requires significant computational resources and time.

Steel-LLM by zhanshijinwat

Explore Similar Projects

rho by microsoft

Mengzi3 by Langboat

InternLM-techreport by InternLM

LLaMA-Cult-and-More by shm007g

Firefly-LLaMA2-Chinese by yangjianxin1

pandallm by dandelionsllm

MAP-NEO by multimodal-art-projection

tiny-llm-zh by wdndev

Skywork by SkyworkAI

Baichuan-7B by baichuan-inc

Llama-Chinese by LlamaFamily

Awesome-Chinese-LLM by HqWu-HITCS