LESS by princeton-nlp

Data selection research paper for targeted instruction tuning

Created 2 years ago

512 stars

Top 61.2% on SourcePulse

View on GitHub

4 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Wing Lian

Founder of Axolotl AI

Travis Fischer

Founder of Agentic

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This repository provides code for LESS, a method to select influential data for targeted instruction tuning of large language models, aimed at researchers and practitioners seeking to optimize fine-tuning efficiency. It enables users to identify and utilize the most impactful data points for specific downstream tasks, thereby improving model performance and reducing computational costs.

How It Works

LESS operates by calculating an "influence score" for each data point in a training set, based on its impact on a target task's performance. This is achieved by first performing a "warmup" LoRA training on a small subset of data. Then, gradients are collected for the entire training dataset and for validation data specific to the target task. By comparing these gradients, LESS estimates the influence of each training data point on the target task, allowing for targeted selection of the most beneficial data.

Quick Start & Requirements

Installation:
1. pip3 install torch==2.1.2 torchvision torchaudio
2. cd LESS
3. pip install -r requirement.txt
4. pip install -e .
Prerequisites: PyTorch. The scripts utilize meta-llama/Llama-2-7b-hf as a base model, and data preparation follows the open-instruct repository.
Resources: Requires significant disk space for gradient storage and computational resources for training and gradient calculation.
Links: Quick Links, Data Preparation

Highlighted Details

ICML 2024 paper.
Supports targeted instruction tuning for specific downstream tasks.
Utilizes LoRA for warmup and gradient calculation.
Includes scripts for data preparation, gradient extraction, data selection, and training.

Maintenance & Community

Bugs or questions can be reported via GitHub Issues or by emailing Mengzhou (mengzhou@princeton.edu).
The project is associated with Princeton NLP.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The README does not specify a license, which may impact commercial use or integration into closed-source projects.
The process involves multiple steps and significant computational overhead for gradient calculation and storage.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days