baby-llama2-chinese by DLLXW

LLM pretraining/SFT repo for small Chinese Llama2 models

Created 2 years ago

2,887 stars

Top 16.4% on SourcePulse

Project Summary

This repository provides a framework for pre-training and fine-tuning small-parameter Chinese Llama 2 models, targeting LLM beginners. It offers a complete pipeline from data processing to model evaluation, enabling users to train a functional Chinese chatbot with as little as 24GB of VRAM.

How It Works

The project leverages the ChatGLM2-6B tokenizer for its efficient 64k vocabulary size, which is optimal for Chinese text. It supports pre-training on large Chinese corpora (up to 63.4 billion tokens) and fine-tuning using instruction datasets like Alpaca-Zh and medical domain data. The approach emphasizes full fine-tuning due to the model's smaller parameter count, with plans to incorporate parameter-efficient methods for larger models.

Quick Start & Requirements

Download pre-processed corpus from Baidu Netdisk (63.4B tokens, 118GB).
Place data in ./data/, modify data_process.py, and run python data_process.py to create pretrain_data.bin.
Adjust model parameters in pretrain.py based on available hardware (e.g., 4x 3090).
Run pre-training: torchrun --standalone --nproc_per_node=4 pretrain.py.
Process SFT data: python sft_data_process.py.
Run SFT fine-tuning: python sft.py.
Evaluate: python eval.py.
- Requires Python, PyTorch, and potentially screen for background execution.
- Official corpus: 百度网盘 (提取码：6unr).

Highlighted Details

Offers pre-trained models ranging from 92M to 218M parameters, trained on up to 63.4B tokens.
Includes specific fine-tuned models for medical domain Q&A (e.g., Llama2-Chinese-218M-v3-MedicalChat).
Provides data cleaning scripts for filtering short texts and deduplication using Minhash/Simhash.
Demonstrates model performance through example outputs for continuation and Q&A tasks.

Maintenance & Community

Active development with recent updates in Jan-May 2024, including new models and data cleaning features.
QQ Group: 716455397 for community engagement.

Licensing & Compatibility

The repository itself does not explicitly state a license. Model weights are provided via Baidu Netdisk links with extraction codes, implying a permissive usage for research.

Limitations & Caveats

The project is primarily aimed at LLM beginners and may not incorporate the latest advanced training techniques (e.g., DeepSpeed, Megatron are noted as not used in current extreme configurations).
SFT model evaluation is based on limited examples, with noted performance degradation on general tasks after domain-specific fine-tuning.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days