Firefly-LLaMA2-Chinese offers open-source Chinese-English bilingual large language models based on LLaMA2. It addresses the need for efficient, low-resource incremental pre-training and instruction fine-tuning for various LLMs, including LLaMA2, Baichuan2, Qwen, and others. The project provides pre-trained and fine-tuned model weights, along with the full training code and datasets, enabling researchers and developers to replicate or build upon their work.
How It Works
The project employs a low-resource incremental pre-training approach, primarily using QLoRA. This involves expanding the vocabulary of base models like LLaMA2 with Chinese tokens to improve encoding efficiency, followed by incremental pre-training on a 22GB Chinese-English corpus. Subsequently, models undergo multi-turn instruction fine-tuning using a large dataset of Chinese and English conversational instructions. This methodology significantly reduces the GPU resources required compared to full fine-tuning, making advanced LLM customization more accessible.
Quick Start & Requirements
- Install/Run: Training code is available for incremental pre-training and instruction fine-tuning. Inference scripts are provided in
script/chat
.
- Prerequisites: Python, PyTorch. Specific hardware requirements depend on the model size (7B/13B) and training stage. QLoRA training was performed on 4x V100 GPUs.
- Resources: Training code and model weights are available on Hugging Face.
- Links: Huggingface Repo
Highlighted Details
- Achieves competitive performance on Open LLM Leaderboard and CMMLU, surpassing models like Linly and Yayi, and competitive with Ziya.
- Demonstrates significant resource efficiency, training models using only 4x V100 GPUs, a fraction of the resources used by comparable models.
- Open-sources a 22GB pre-training dataset and multi-turn instruction datasets, along with the complete training pipeline.
- Includes QLoRA-tuned versions of 7B and 13B models for even lower resource fine-tuning.
Maintenance & Community
- The project is actively maintained by yangjianxin1.
- Community discussion is encouraged via WeChat groups and Zhihu.
Licensing & Compatibility
- The project's specific license is not explicitly stated in the README, but it emphasizes adherence to the original models' open-source licenses. Users should verify compatibility for commercial or closed-source use.
Limitations & Caveats
- Models may generate inappropriate content due to their size and lack of explicit value alignment.
- The pre-training dataset is relatively small and heavily news-oriented, potentially impacting performance on certain Chinese tasks.
- Users must comply with the licensing terms of the base models used.