Chinese LLMs and datasets for pretraining/finetuning
Top 16.0% on sourcepulse
Linly provides Chinese-centric large language models, including Chinese-LLaMA, Chinese-Falcon, and the conversational model Linly-ChatFlow. It aims to extend LLaMA and Falcon's capabilities to Chinese through incremental pre-training on Chinese and mixed-language corpora, followed by large-scale instruction tuning. The project offers models of various sizes (3B, 7B, 13B, 33B, 70B) and provides code for data preparation, training, and evaluation, along with quantization options for deployment.
How It Works
The project builds upon existing foundational models like LLaMA and Falcon. It involves expanding their vocabularies with Chinese tokens and performing incremental pre-training on extensive Chinese and mixed-language datasets. For conversational capabilities, models undergo instruction fine-tuning using curated Chinese instruction datasets. The project emphasizes reproducibility, offering full parameter training details and code.
Quick Start & Requirements
llama_inference
repository.Highlighted Details
Maintenance & Community
The project is actively developed with regular model updates. Community interaction is encouraged via GitHub issues.
Licensing & Compatibility
Limitations & Caveats
Models are trained on community-sourced data without manual curation, leading to potential weaknesses in multi-turn dialogue, logical reasoning, and knowledge Q&A. Models may generate biased or harmful content. The README notes that the models are still under development and their language capabilities are continuously improving.
1 year ago
1 day