Chinese-Mixtral-8x7B  by HIT-SCIR

Chinese Mixtral-8x7B: a base model for Chinese language

created 1 year ago
650 stars

Top 52.3% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides Chinese-Mixtral-8x7B, an open-source Mixture-of-Experts (MoE) large language model enhanced for Chinese language processing. It offers a base model and fine-tuned versions, targeting researchers and developers looking to leverage advanced MoE architectures for Chinese NLP tasks, with benefits including improved Chinese tokenization efficiency and strong performance on both Chinese and English benchmarks.

How It Works

The project builds upon Mistral's Mixtral-8x7B architecture by expanding its vocabulary with a custom Chinese token set, trained using SentencePiece on Chinese datasets. This expanded vocabulary significantly improves the model's Chinese tokenization efficiency. Incremental pre-training is then performed on this modified model using a large corpus of Chinese and English data, including SkyPile and SlimPajama, to imbue it with strong Chinese generation and understanding capabilities. Training utilizes QLoRA for efficient fine-tuning, employing 4-bit quantization and other memory-saving techniques.

Quick Start & Requirements

  • Install: pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 transformers==4.36.2 datasets evaluate peft accelerate gradio optimum sentencepiece trl jupyterlab scikit-learn pandas matplotlib tensorboard nltk rouge bitsandbytes fire
  • Prerequisites: Python 3.10+, CUDA Toolkit (e.g., 11.7.1), DeepSpeed, Flash Attention.
  • Model Download: Full model (88GB) or LoRA adapter (2.7GB) available on HuggingFace/ModelScope.
  • Resources: Requires significant GPU memory for inference (e.g., 4-bit quantization). Training requires substantial resources and a multi-GPU setup.
  • Docs: HuggingFace, ModelScope

Highlighted Details

  • Achieves state-of-the-art English performance among comparable Chinese-enhanced models, with strong Chinese understanding and generation capabilities.
  • Demonstrates a 41.5% improvement in Chinese tokenization efficiency compared to the original Mixtral-8x7B.
  • Training code for both incremental pre-training and instruction fine-tuning is open-sourced.
  • Supports acceleration via vLLM and Flash Attention 2, and quantization with bitsandbytes.

Maintenance & Community

  • The project is maintained by HIT-SCIR.
  • Recent releases include instruction-tuned models ("Huozhi 3.0") and open-sourced fine-tuning code.

Licensing & Compatibility

  • The model weights are released under a permissive license, allowing for commercial use and integration into closed-source applications. The specific license is not explicitly stated in the README but is typical for HuggingFace model releases.

Limitations & Caveats

  • The base model is not instruction-tuned and has limited instruction-following capabilities.
  • The model may still generate factually incorrect, misleading, biased, or harmful content, requiring careful user discretion.
Health Check
Last commit

11 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.