deita by hkust-nlp

Data-efficient instruction tuning for LLM alignment (ICLR 2024)

Created 2 years ago

580 stars

Top 55.8% on SourcePulse

View on GitHub

3 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Yaowei Zheng

Author of LLaMA-Factory

Wing Lian

Founder of Axolotl AI

Project Summary

Deita provides toolkits for automatic data selection in instruction tuning for Large Language Models (LLMs), enabling efficient alignment with significantly less data. It offers pre-curated datasets (6K and 10K) and powerful models that achieve state-of-the-art performance, making it suitable for researchers and developers aiming to improve LLM alignment cost-effectively.

How It Works

Deita employs automatic data selection strategies, including complexity and quality scoring, to curate high-quality instruction tuning datasets. It leverages scorer models (e.g., based on LLaMA) to evaluate data samples, allowing for the creation of smaller, more effective datasets. This approach contrasts with traditional methods that rely on much larger, less curated datasets, leading to faster and more efficient model training.

Quick Start & Requirements

Installation: git clone https://github.com/hkust-nlp/deita.git && cd deita && pip install -e .
Prerequisites: Python, Hugging Face libraries, DeepSpeed (for training). Optional: vllm for faster inference.
Setup: Cloning the repo and installing dependencies is quick. Training requires significant GPU resources.
Resources: Deita HF Repo, Paper, Deita Datasets

Highlighted Details

Achieves competitive performance with significantly less data (e.g., 6K-10K vs. 200K+).
Models like DEITA-7B-v1.0 (6K SFT + 10K DPO) reach 7.55 MT-Bench and 90.06 AlpacaEval.
Datasets have been used by Hugging Face for models like Zephyr Gemma.
Offers pipelines for data scoring, embedding generation, and data filtering.

Maintenance & Community

The project is actively updated, with recent releases in March 2024. It cites FastChat for training code. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

Datasets: MIT License.
Scorers: LLaMA License.
Models: Apache-2.0 (for Mistral-based), LLaMA 2 License (for LLaMA-based).
Compatibility for commercial use depends on the specific model/dataset license; LLaMA-based components may have restrictions.

Limitations & Caveats

The project is described as a "preview version," with plans for future updates including a CLI interface and more data selection strategies. The LLaMA and LLaMA 2 licenses for some components may impose restrictions on commercial use or redistribution.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days