deita  by hkust-nlp

Data-efficient instruction tuning for LLM alignment (ICLR 2024)

created 1 year ago
560 stars

Top 58.2% on sourcepulse

GitHubView on GitHub
Project Summary

Deita provides toolkits for automatic data selection in instruction tuning for Large Language Models (LLMs), enabling efficient alignment with significantly less data. It offers pre-curated datasets (6K and 10K) and powerful models that achieve state-of-the-art performance, making it suitable for researchers and developers aiming to improve LLM alignment cost-effectively.

How It Works

Deita employs automatic data selection strategies, including complexity and quality scoring, to curate high-quality instruction tuning datasets. It leverages scorer models (e.g., based on LLaMA) to evaluate data samples, allowing for the creation of smaller, more effective datasets. This approach contrasts with traditional methods that rely on much larger, less curated datasets, leading to faster and more efficient model training.

Quick Start & Requirements

  • Installation: git clone https://github.com/hkust-nlp/deita.git && cd deita && pip install -e .
  • Prerequisites: Python, Hugging Face libraries, DeepSpeed (for training). Optional: vllm for faster inference.
  • Setup: Cloning the repo and installing dependencies is quick. Training requires significant GPU resources.
  • Resources: Deita HF Repo, Paper, Deita Datasets

Highlighted Details

  • Achieves competitive performance with significantly less data (e.g., 6K-10K vs. 200K+).
  • Models like DEITA-7B-v1.0 (6K SFT + 10K DPO) reach 7.55 MT-Bench and 90.06 AlpacaEval.
  • Datasets have been used by Hugging Face for models like Zephyr Gemma.
  • Offers pipelines for data scoring, embedding generation, and data filtering.

Maintenance & Community

The project is actively updated, with recent releases in March 2024. It cites FastChat for training code. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • Datasets: MIT License.
  • Scorers: LLaMA License.
  • Models: Apache-2.0 (for Mistral-based), LLaMA 2 License (for LLaMA-based).
  • Compatibility for commercial use depends on the specific model/dataset license; LLaMA-based components may have restrictions.

Limitations & Caveats

The project is described as a "preview version," with plans for future updates including a CLI interface and more data selection strategies. The LLaMA and LLaMA 2 licenses for some components may impose restrictions on commercial use or redistribution.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
3
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
9 more.

alpaca-lora by tloen

0.0%
19k
LoRA fine-tuning for LLaMA
created 2 years ago
updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), John Yang John Yang(Author of SWE-bench, SWE-agent), and
13 more.

stanford_alpaca by tatsu-lab

0.1%
30k
Instruction-following LLaMA model training and data generation
created 2 years ago
updated 1 year ago
Feedback? Help us improve.