Firefly-LLaMA2-Chinese  by yangjianxin1

LLM for Chinese LLaMA-2, supporting incremental pre-training

created 2 years ago
410 stars

Top 72.3% on sourcepulse

GitHubView on GitHub
Project Summary

Firefly-LLaMA2-Chinese offers open-source Chinese-English bilingual large language models based on LLaMA2. It addresses the need for efficient, low-resource incremental pre-training and instruction fine-tuning for various LLMs, including LLaMA2, Baichuan2, Qwen, and others. The project provides pre-trained and fine-tuned model weights, along with the full training code and datasets, enabling researchers and developers to replicate or build upon their work.

How It Works

The project employs a low-resource incremental pre-training approach, primarily using QLoRA. This involves expanding the vocabulary of base models like LLaMA2 with Chinese tokens to improve encoding efficiency, followed by incremental pre-training on a 22GB Chinese-English corpus. Subsequently, models undergo multi-turn instruction fine-tuning using a large dataset of Chinese and English conversational instructions. This methodology significantly reduces the GPU resources required compared to full fine-tuning, making advanced LLM customization more accessible.

Quick Start & Requirements

  • Install/Run: Training code is available for incremental pre-training and instruction fine-tuning. Inference scripts are provided in script/chat.
  • Prerequisites: Python, PyTorch. Specific hardware requirements depend on the model size (7B/13B) and training stage. QLoRA training was performed on 4x V100 GPUs.
  • Resources: Training code and model weights are available on Hugging Face.
  • Links: Huggingface Repo

Highlighted Details

  • Achieves competitive performance on Open LLM Leaderboard and CMMLU, surpassing models like Linly and Yayi, and competitive with Ziya.
  • Demonstrates significant resource efficiency, training models using only 4x V100 GPUs, a fraction of the resources used by comparable models.
  • Open-sources a 22GB pre-training dataset and multi-turn instruction datasets, along with the complete training pipeline.
  • Includes QLoRA-tuned versions of 7B and 13B models for even lower resource fine-tuning.

Maintenance & Community

  • The project is actively maintained by yangjianxin1.
  • Community discussion is encouraged via WeChat groups and Zhihu.

Licensing & Compatibility

  • The project's specific license is not explicitly stated in the README, but it emphasizes adherence to the original models' open-source licenses. Users should verify compatibility for commercial or closed-source use.

Limitations & Caveats

  • Models may generate inappropriate content due to their size and lack of explicit value alignment.
  • The pre-training dataset is relatively small and heavily news-oriented, potentially impacting performance on certain Chinese tasks.
  • Users must comply with the licensing terms of the base models used.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.