Linly  by CVI-SZU

Chinese LLMs and datasets for pretraining/finetuning

created 2 years ago
3,055 stars

Top 16.0% on sourcepulse

GitHubView on GitHub
Project Summary

Linly provides Chinese-centric large language models, including Chinese-LLaMA, Chinese-Falcon, and the conversational model Linly-ChatFlow. It aims to extend LLaMA and Falcon's capabilities to Chinese through incremental pre-training on Chinese and mixed-language corpora, followed by large-scale instruction tuning. The project offers models of various sizes (3B, 7B, 13B, 33B, 70B) and provides code for data preparation, training, and evaluation, along with quantization options for deployment.

How It Works

The project builds upon existing foundational models like LLaMA and Falcon. It involves expanding their vocabularies with Chinese tokens and performing incremental pre-training on extensive Chinese and mixed-language datasets. For conversational capabilities, models undergo instruction fine-tuning using curated Chinese instruction datasets. The project emphasizes reproducibility, offering full parameter training details and code.

Quick Start & Requirements

  • Installation: Primarily uses Hugging Face format models. Inference can be done via the llama_inference repository.
  • Prerequisites: Python 3.8+, CUDA 11.2+, PyTorch 1.9+, bitsandbytes 0.37.2.
  • Resources: 7B models require ~14GB VRAM (7GB in INT8), 13B models require ~28GB VRAM (14GB in INT8). Training requires significant GPU resources.
  • Links: Huggingface Models, llama_inference, TencentPretrain.

Highlighted Details

  • Offers both base and chat models, with benchmarks provided for Linly-70B.
  • Includes an Apache 2.0 licensed Linly-OpenLLaMA model trained from scratch.
  • Supports multiple quantization methods (INT8, INT4) for efficient deployment.
  • Provides full training pipeline code for transparency and reproducibility.

Maintenance & Community

The project is actively developed with regular model updates. Community interaction is encouraged via GitHub issues.

Licensing & Compatibility

  • Code & Documents: Apache License 2.0.
  • Pre-trained Weights: GNU General Public License v3.0 (following LLaMA), restricting commercial use for LLaMA-based models. Chinese-Falcon models are under Apache 2.0, permitting commercial use.

Limitations & Caveats

Models are trained on community-sourced data without manual curation, leading to potential weaknesses in multi-turn dialogue, logical reasoning, and knowledge Q&A. Models may generate biased or harmful content. The README notes that the models are still under development and their language capabilities are continuously improving.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
9 more.

alpaca-lora by tloen

0.0%
19k
LoRA fine-tuning for LLaMA
created 2 years ago
updated 1 year ago
Feedback? Help us improve.