baby-llama2-chinese  by DLLXW

LLM pretraining/SFT repo for small Chinese Llama2 models

created 2 years ago
2,827 stars

Top 17.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a framework for pre-training and fine-tuning small-parameter Chinese Llama 2 models, targeting LLM beginners. It offers a complete pipeline from data processing to model evaluation, enabling users to train a functional Chinese chatbot with as little as 24GB of VRAM.

How It Works

The project leverages the ChatGLM2-6B tokenizer for its efficient 64k vocabulary size, which is optimal for Chinese text. It supports pre-training on large Chinese corpora (up to 63.4 billion tokens) and fine-tuning using instruction datasets like Alpaca-Zh and medical domain data. The approach emphasizes full fine-tuning due to the model's smaller parameter count, with plans to incorporate parameter-efficient methods for larger models.

Quick Start & Requirements

  1. Download pre-processed corpus from Baidu Netdisk (63.4B tokens, 118GB).
  2. Place data in ./data/, modify data_process.py, and run python data_process.py to create pretrain_data.bin.
  3. Adjust model parameters in pretrain.py based on available hardware (e.g., 4x 3090).
  4. Run pre-training: torchrun --standalone --nproc_per_node=4 pretrain.py.
  5. Process SFT data: python sft_data_process.py.
  6. Run SFT fine-tuning: python sft.py.
  7. Evaluate: python eval.py.
    • Requires Python, PyTorch, and potentially screen for background execution.
    • Official corpus: 百度网盘 (提取码:6unr).

Highlighted Details

  • Offers pre-trained models ranging from 92M to 218M parameters, trained on up to 63.4B tokens.
  • Includes specific fine-tuned models for medical domain Q&A (e.g., Llama2-Chinese-218M-v3-MedicalChat).
  • Provides data cleaning scripts for filtering short texts and deduplication using Minhash/Simhash.
  • Demonstrates model performance through example outputs for continuation and Q&A tasks.

Maintenance & Community

  • Active development with recent updates in Jan-May 2024, including new models and data cleaning features.
  • QQ Group: 716455397 for community engagement.

Licensing & Compatibility

  • The repository itself does not explicitly state a license. Model weights are provided via Baidu Netdisk links with extraction codes, implying a permissive usage for research.

Limitations & Caveats

  • The project is primarily aimed at LLM beginners and may not incorporate the latest advanced training techniques (e.g., DeepSpeed, Megatron are noted as not used in current extreme configurations).
  • SFT model evaluation is based on limited examples, with noted performance degradation on general tasks after domain-specific fine-tuning.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
68 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Travis Fischer Travis Fischer(Founder of Agentic), and
6 more.

codellama by meta-llama

0.0%
16k
Inference code for CodeLlama models
created 1 year ago
updated 11 months ago
Feedback? Help us improve.