BELLE by LianjiaTech

Chinese LLM engine for democratized access and instruction tuning

Created 2 years ago

8,285 stars

Top 6.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Elvis Saravia

Founder of DAIR.AI

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

BELLE is an open-source Chinese conversational large language model engine aiming to lower the barrier for research and application of LLMs, particularly in Chinese. It focuses on providing accessible instruction-following models and training data, enabling users to develop their own high-quality conversational AI.

How It Works

BELLE fine-tunes existing large language models, primarily LLaMA and BLOOMZ, using a substantial corpus of Chinese conversational data. The project emphasizes the impact of training data quality, quantity, and language distribution on model performance, exploring techniques like vocabulary expansion and efficient fine-tuning methods such as LoRA.

Quick Start & Requirements

Installation: Primarily through Hugging Face repositories.
Prerequisites: Access to base models (e.g., LLaMA) is required due to licensing. Specific models may require significant GPU resources for inference and fine-tuning.
Resources: Fine-tuning requires substantial GPU memory (e.g., 8x NVIDIA A100-40GB for reported experiments). Quantized models (GPTQ) are available for reduced inference requirements.
Links: Huggingface Repos, BELLE-2, Discord.

Highlighted Details

Offers a range of fine-tuned models based on LLaMA and BLOOMZ, with specific Chinese language enhancements.
Provides extensive training code, including support for DeepSpeed-Chat, LoRA, and PPO/DPO.
Releases curated datasets for instruction tuning and evaluation, with ongoing contributions.
Developed multilingual speech recognition models (Belle-whisper) with significant Chinese performance improvements.

Maintenance & Community

The project is actively maintained by the BELLEGroup, with regular updates on new models, research reports, and training code. Community engagement is encouraged via Discord and WeChat.

Licensing & Compatibility

Code License: Apache 2.0.
Model Weights: Subject to the original base model licenses (e.g., LLaMA's non-commercial research use). Model weights are often distributed as diffs or XOR files to comply with these restrictions.

Limitations & Caveats

Models may produce factually incorrect or harmful responses and require further improvement in reasoning, coding, and multi-turn dialogue. The project explicitly states models are for research purposes only and prohibits commercial or harmful use. The evaluation methodology has limitations, and reported scores may not fully reflect real-world user experience.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days