baby-llama2-chinese  by DLLXW

LLM pretraining/SFT repo for small Chinese Llama2 models

Created 2 years ago
2,852 stars

Top 16.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a framework for pre-training and fine-tuning small-parameter Chinese Llama 2 models, targeting LLM beginners. It offers a complete pipeline from data processing to model evaluation, enabling users to train a functional Chinese chatbot with as little as 24GB of VRAM.

How It Works

The project leverages the ChatGLM2-6B tokenizer for its efficient 64k vocabulary size, which is optimal for Chinese text. It supports pre-training on large Chinese corpora (up to 63.4 billion tokens) and fine-tuning using instruction datasets like Alpaca-Zh and medical domain data. The approach emphasizes full fine-tuning due to the model's smaller parameter count, with plans to incorporate parameter-efficient methods for larger models.

Quick Start & Requirements

  1. Download pre-processed corpus from Baidu Netdisk (63.4B tokens, 118GB).
  2. Place data in ./data/, modify data_process.py, and run python data_process.py to create pretrain_data.bin.
  3. Adjust model parameters in pretrain.py based on available hardware (e.g., 4x 3090).
  4. Run pre-training: torchrun --standalone --nproc_per_node=4 pretrain.py.
  5. Process SFT data: python sft_data_process.py.
  6. Run SFT fine-tuning: python sft.py.
  7. Evaluate: python eval.py.
    • Requires Python, PyTorch, and potentially screen for background execution.
    • Official corpus: 百度网盘 (提取码:6unr).

Highlighted Details

  • Offers pre-trained models ranging from 92M to 218M parameters, trained on up to 63.4B tokens.
  • Includes specific fine-tuned models for medical domain Q&A (e.g., Llama2-Chinese-218M-v3-MedicalChat).
  • Provides data cleaning scripts for filtering short texts and deduplication using Minhash/Simhash.
  • Demonstrates model performance through example outputs for continuation and Q&A tasks.

Maintenance & Community

  • Active development with recent updates in Jan-May 2024, including new models and data cleaning features.
  • QQ Group: 716455397 for community engagement.

Licensing & Compatibility

  • The repository itself does not explicitly state a license. Model weights are provided via Baidu Netdisk links with extraction codes, implying a permissive usage for research.

Limitations & Caveats

  • The project is primarily aimed at LLM beginners and may not incorporate the latest advanced training techniques (e.g., DeepSpeed, Megatron are noted as not used in current extreme configurations).
  • SFT model evaluation is based on limited examples, with noted performance degradation on general tasks after domain-specific fine-tuning.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
20 more.

TinyLlama by jzhang38

0.1%
9k
Tiny pretraining project for a 1.1B Llama model
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.