deita  by hkust-nlp

Data-efficient instruction tuning for LLM alignment (ICLR 2024)

Created 2 years ago
580 stars

Top 55.8% on SourcePulse

GitHubView on GitHub
Project Summary

Deita provides toolkits for automatic data selection in instruction tuning for Large Language Models (LLMs), enabling efficient alignment with significantly less data. It offers pre-curated datasets (6K and 10K) and powerful models that achieve state-of-the-art performance, making it suitable for researchers and developers aiming to improve LLM alignment cost-effectively.

How It Works

Deita employs automatic data selection strategies, including complexity and quality scoring, to curate high-quality instruction tuning datasets. It leverages scorer models (e.g., based on LLaMA) to evaluate data samples, allowing for the creation of smaller, more effective datasets. This approach contrasts with traditional methods that rely on much larger, less curated datasets, leading to faster and more efficient model training.

Quick Start & Requirements

  • Installation: git clone https://github.com/hkust-nlp/deita.git && cd deita && pip install -e .
  • Prerequisites: Python, Hugging Face libraries, DeepSpeed (for training). Optional: vllm for faster inference.
  • Setup: Cloning the repo and installing dependencies is quick. Training requires significant GPU resources.
  • Resources: Deita HF Repo, Paper, Deita Datasets

Highlighted Details

  • Achieves competitive performance with significantly less data (e.g., 6K-10K vs. 200K+).
  • Models like DEITA-7B-v1.0 (6K SFT + 10K DPO) reach 7.55 MT-Bench and 90.06 AlpacaEval.
  • Datasets have been used by Hugging Face for models like Zephyr Gemma.
  • Offers pipelines for data scoring, embedding generation, and data filtering.

Maintenance & Community

The project is actively updated, with recent releases in March 2024. It cites FastChat for training code. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • Datasets: MIT License.
  • Scorers: LLaMA License.
  • Models: Apache-2.0 (for Mistral-based), LLaMA 2 License (for LLaMA-based).
  • Compatibility for commercial use depends on the specific model/dataset license; LLaMA-based components may have restrictions.

Limitations & Caveats

The project is described as a "preview version," with plans for future updates including a CLI interface and more data selection strategies. The LLaMA and LLaMA 2 licenses for some components may impose restrictions on commercial use or redistribution.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

xTuring by stochasticai

0%
3k
SDK for fine-tuning and customizing open-source LLMs
Created 2 years ago
Updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
3 more.

Alpaca-CoT by PhoebusSi

0.0%
3k
IFT platform for instruction collection, parameter-efficient methods, and LLMs
Created 2 years ago
Updated 2 years ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
13 more.

open-instruct by allenai

0.5%
4k
Training codebase for instruction-following language models
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.