BELLE  by LianjiaTech

Chinese LLM engine for democratized access and instruction tuning

created 2 years ago
8,211 stars

Top 6.4% on sourcepulse

GitHubView on GitHub
Project Summary

BELLE is an open-source Chinese conversational large language model engine aiming to lower the barrier for research and application of LLMs, particularly in Chinese. It focuses on providing accessible instruction-following models and training data, enabling users to develop their own high-quality conversational AI.

How It Works

BELLE fine-tunes existing large language models, primarily LLaMA and BLOOMZ, using a substantial corpus of Chinese conversational data. The project emphasizes the impact of training data quality, quantity, and language distribution on model performance, exploring techniques like vocabulary expansion and efficient fine-tuning methods such as LoRA.

Quick Start & Requirements

  • Installation: Primarily through Hugging Face repositories.
  • Prerequisites: Access to base models (e.g., LLaMA) is required due to licensing. Specific models may require significant GPU resources for inference and fine-tuning.
  • Resources: Fine-tuning requires substantial GPU memory (e.g., 8x NVIDIA A100-40GB for reported experiments). Quantized models (GPTQ) are available for reduced inference requirements.
  • Links: Huggingface Repos, BELLE-2, Discord.

Highlighted Details

  • Offers a range of fine-tuned models based on LLaMA and BLOOMZ, with specific Chinese language enhancements.
  • Provides extensive training code, including support for DeepSpeed-Chat, LoRA, and PPO/DPO.
  • Releases curated datasets for instruction tuning and evaluation, with ongoing contributions.
  • Developed multilingual speech recognition models (Belle-whisper) with significant Chinese performance improvements.

Maintenance & Community

The project is actively maintained by the BELLEGroup, with regular updates on new models, research reports, and training code. Community engagement is encouraged via Discord and WeChat.

Licensing & Compatibility

  • Code License: Apache 2.0.
  • Model Weights: Subject to the original base model licenses (e.g., LLaMA's non-commercial research use). Model weights are often distributed as diffs or XOR files to comply with these restrictions.

Limitations & Caveats

Models may produce factually incorrect or harmful responses and require further improvement in reasoning, coding, and multi-turn dialogue. The project explicitly states models are for research purposes only and prohibits commercial or harmful use. The evaluation methodology has limitations, and reported scores may not fully reflect real-world user experience.

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
103 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

ChatGLM-6B by zai-org

0.1%
41k
Bilingual dialogue language model for research
created 2 years ago
updated 1 year ago
Feedback? Help us improve.