Firefly-LLaMA2-Chinese by yangjianxin1

LLM for Chinese LLaMA-2, supporting incremental pre-training

Created 2 years ago

414 stars

Top 70.7% on SourcePulse

Project Summary

Firefly-LLaMA2-Chinese offers open-source Chinese-English bilingual large language models based on LLaMA2. It addresses the need for efficient, low-resource incremental pre-training and instruction fine-tuning for various LLMs, including LLaMA2, Baichuan2, Qwen, and others. The project provides pre-trained and fine-tuned model weights, along with the full training code and datasets, enabling researchers and developers to replicate or build upon their work.

How It Works

The project employs a low-resource incremental pre-training approach, primarily using QLoRA. This involves expanding the vocabulary of base models like LLaMA2 with Chinese tokens to improve encoding efficiency, followed by incremental pre-training on a 22GB Chinese-English corpus. Subsequently, models undergo multi-turn instruction fine-tuning using a large dataset of Chinese and English conversational instructions. This methodology significantly reduces the GPU resources required compared to full fine-tuning, making advanced LLM customization more accessible.

Quick Start & Requirements

Install/Run: Training code is available for incremental pre-training and instruction fine-tuning. Inference scripts are provided in script/chat.
Prerequisites: Python, PyTorch. Specific hardware requirements depend on the model size (7B/13B) and training stage. QLoRA training was performed on 4x V100 GPUs.
Resources: Training code and model weights are available on Hugging Face.
Links: Huggingface Repo

Highlighted Details

Achieves competitive performance on Open LLM Leaderboard and CMMLU, surpassing models like Linly and Yayi, and competitive with Ziya.
Demonstrates significant resource efficiency, training models using only 4x V100 GPUs, a fraction of the resources used by comparable models.
Open-sources a 22GB pre-training dataset and multi-turn instruction datasets, along with the complete training pipeline.
Includes QLoRA-tuned versions of 7B and 13B models for even lower resource fine-tuning.

Maintenance & Community

The project is actively maintained by yangjianxin1.
Community discussion is encouraged via WeChat groups and Zhihu.

Licensing & Compatibility

The project's specific license is not explicitly stated in the README, but it emphasizes adherence to the original models' open-source licenses. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

Models may generate inappropriate content due to their size and lack of explicit value alignment.
The pre-training dataset is relatively small and heavily news-oriented, potentially impacting performance on certain Chinese tasks.
Users must comply with the licensing terms of the base models used.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days