Linly  by CVI-SZU

Chinese LLMs and datasets for pretraining/finetuning

Created 2 years ago
3,060 stars

Top 15.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Linly provides Chinese-centric large language models, including Chinese-LLaMA, Chinese-Falcon, and the conversational model Linly-ChatFlow. It aims to extend LLaMA and Falcon's capabilities to Chinese through incremental pre-training on Chinese and mixed-language corpora, followed by large-scale instruction tuning. The project offers models of various sizes (3B, 7B, 13B, 33B, 70B) and provides code for data preparation, training, and evaluation, along with quantization options for deployment.

How It Works

The project builds upon existing foundational models like LLaMA and Falcon. It involves expanding their vocabularies with Chinese tokens and performing incremental pre-training on extensive Chinese and mixed-language datasets. For conversational capabilities, models undergo instruction fine-tuning using curated Chinese instruction datasets. The project emphasizes reproducibility, offering full parameter training details and code.

Quick Start & Requirements

  • Installation: Primarily uses Hugging Face format models. Inference can be done via the llama_inference repository.
  • Prerequisites: Python 3.8+, CUDA 11.2+, PyTorch 1.9+, bitsandbytes 0.37.2.
  • Resources: 7B models require ~14GB VRAM (7GB in INT8), 13B models require ~28GB VRAM (14GB in INT8). Training requires significant GPU resources.
  • Links: Huggingface Models, llama_inference, TencentPretrain.

Highlighted Details

  • Offers both base and chat models, with benchmarks provided for Linly-70B.
  • Includes an Apache 2.0 licensed Linly-OpenLLaMA model trained from scratch.
  • Supports multiple quantization methods (INT8, INT4) for efficient deployment.
  • Provides full training pipeline code for transparency and reproducibility.

Maintenance & Community

The project is actively developed with regular model updates. Community interaction is encouraged via GitHub issues.

Licensing & Compatibility

  • Code & Documents: Apache License 2.0.
  • Pre-trained Weights: GNU General Public License v3.0 (following LLaMA), restricting commercial use for LLaMA-based models. Chinese-Falcon models are under Apache 2.0, permitting commercial use.

Limitations & Caveats

Models are trained on community-sourced data without manual curation, leading to potential weaknesses in multi-turn dialogue, logical reasoning, and knowledge Q&A. Models may generate biased or harmful content. The README notes that the models are still under development and their language capabilities are continuously improving.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

Yi by 01-ai

0%
8k
Open-source bilingual LLMs trained from scratch
Created 1 year ago
Updated 9 months ago
Feedback? Help us improve.