deita  by hkust-nlp

Data-efficient instruction tuning for LLM alignment (ICLR 2024)

Created 1 year ago
567 stars

Top 56.7% on SourcePulse

GitHubView on GitHub
Project Summary

Deita provides toolkits for automatic data selection in instruction tuning for Large Language Models (LLMs), enabling efficient alignment with significantly less data. It offers pre-curated datasets (6K and 10K) and powerful models that achieve state-of-the-art performance, making it suitable for researchers and developers aiming to improve LLM alignment cost-effectively.

How It Works

Deita employs automatic data selection strategies, including complexity and quality scoring, to curate high-quality instruction tuning datasets. It leverages scorer models (e.g., based on LLaMA) to evaluate data samples, allowing for the creation of smaller, more effective datasets. This approach contrasts with traditional methods that rely on much larger, less curated datasets, leading to faster and more efficient model training.

Quick Start & Requirements

  • Installation: git clone https://github.com/hkust-nlp/deita.git && cd deita && pip install -e .
  • Prerequisites: Python, Hugging Face libraries, DeepSpeed (for training). Optional: vllm for faster inference.
  • Setup: Cloning the repo and installing dependencies is quick. Training requires significant GPU resources.
  • Resources: Deita HF Repo, Paper, Deita Datasets

Highlighted Details

  • Achieves competitive performance with significantly less data (e.g., 6K-10K vs. 200K+).
  • Models like DEITA-7B-v1.0 (6K SFT + 10K DPO) reach 7.55 MT-Bench and 90.06 AlpacaEval.
  • Datasets have been used by Hugging Face for models like Zephyr Gemma.
  • Offers pipelines for data scoring, embedding generation, and data filtering.

Maintenance & Community

The project is actively updated, with recent releases in March 2024. It cites FastChat for training code. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • Datasets: MIT License.
  • Scorers: LLaMA License.
  • Models: Apache-2.0 (for Mistral-based), LLaMA 2 License (for LLaMA-based).
  • Compatibility for commercial use depends on the specific model/dataset license; LLaMA-based components may have restrictions.

Limitations & Caveats

The project is described as a "preview version," with plans for future updates including a CLI interface and more data selection strategies. The LLaMA and LLaMA 2 licenses for some components may impose restrictions on commercial use or redistribution.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

xTuring by stochasticai

0.0%
3k
SDK for fine-tuning and customizing open-source LLMs
Created 2 years ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
3 more.

Alpaca-CoT by PhoebusSi

0.1%
3k
IFT platform for instruction collection, parameter-efficient methods, and LLMs
Created 2 years ago
Updated 1 year ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code), and
11 more.

open-instruct by allenai

0.7%
3k
Training codebase for instruction-following language models
Created 2 years ago
Updated 17 hours ago
Feedback? Help us improve.