Chinese-Mixtral-8x7B by HIT-SCIR

Chinese Mixtral-8x7B: a base model for Chinese language

Created 2 years ago

656 stars

Top 51.2% on SourcePulse

Project Summary

This project provides Chinese-Mixtral-8x7B, an open-source Mixture-of-Experts (MoE) large language model enhanced for Chinese language processing. It offers a base model and fine-tuned versions, targeting researchers and developers looking to leverage advanced MoE architectures for Chinese NLP tasks, with benefits including improved Chinese tokenization efficiency and strong performance on both Chinese and English benchmarks.

How It Works

The project builds upon Mistral's Mixtral-8x7B architecture by expanding its vocabulary with a custom Chinese token set, trained using SentencePiece on Chinese datasets. This expanded vocabulary significantly improves the model's Chinese tokenization efficiency. Incremental pre-training is then performed on this modified model using a large corpus of Chinese and English data, including SkyPile and SlimPajama, to imbue it with strong Chinese generation and understanding capabilities. Training utilizes QLoRA for efficient fine-tuning, employing 4-bit quantization and other memory-saving techniques.

Quick Start & Requirements

Install: pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 transformers==4.36.2 datasets evaluate peft accelerate gradio optimum sentencepiece trl jupyterlab scikit-learn pandas matplotlib tensorboard nltk rouge bitsandbytes fire
Prerequisites: Python 3.10+, CUDA Toolkit (e.g., 11.7.1), DeepSpeed, Flash Attention.
Model Download: Full model (88GB) or LoRA adapter (2.7GB) available on HuggingFace/ModelScope.
Resources: Requires significant GPU memory for inference (e.g., 4-bit quantization). Training requires substantial resources and a multi-GPU setup.
Docs: HuggingFace, ModelScope

Highlighted Details

Achieves state-of-the-art English performance among comparable Chinese-enhanced models, with strong Chinese understanding and generation capabilities.
Demonstrates a 41.5% improvement in Chinese tokenization efficiency compared to the original Mixtral-8x7B.
Training code for both incremental pre-training and instruction fine-tuning is open-sourced.
Supports acceleration via vLLM and Flash Attention 2, and quantization with bitsandbytes.

Maintenance & Community

The project is maintained by HIT-SCIR.
Recent releases include instruction-tuned models ("Huozhi 3.0") and open-sourced fine-tuning code.

Licensing & Compatibility

The model weights are released under a permissive license, allowing for commercial use and integration into closed-source applications. The specific license is not explicitly stated in the README but is typical for HuggingFace model releases.

Limitations & Caveats

The base model is not instruction-tuned and has limited instruction-following capabilities.
The model may still generate factually incorrect, misleading, biased, or harmful content, requiring careful user discretion.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days