LESS  by princeton-nlp

Data selection research paper for targeted instruction tuning

created 1 year ago
474 stars

Top 65.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code for LESS, a method to select influential data for targeted instruction tuning of large language models, aimed at researchers and practitioners seeking to optimize fine-tuning efficiency. It enables users to identify and utilize the most impactful data points for specific downstream tasks, thereby improving model performance and reducing computational costs.

How It Works

LESS operates by calculating an "influence score" for each data point in a training set, based on its impact on a target task's performance. This is achieved by first performing a "warmup" LoRA training on a small subset of data. Then, gradients are collected for the entire training dataset and for validation data specific to the target task. By comparing these gradients, LESS estimates the influence of each training data point on the target task, allowing for targeted selection of the most beneficial data.

Quick Start & Requirements

  • Installation:
    1. pip3 install torch==2.1.2 torchvision torchaudio
    2. cd LESS
    3. pip install -r requirement.txt
    4. pip install -e .
  • Prerequisites: PyTorch. The scripts utilize meta-llama/Llama-2-7b-hf as a base model, and data preparation follows the open-instruct repository.
  • Resources: Requires significant disk space for gradient storage and computational resources for training and gradient calculation.
  • Links: Quick Links, Data Preparation

Highlighted Details

  • ICML 2024 paper.
  • Supports targeted instruction tuning for specific downstream tasks.
  • Utilizes LoRA for warmup and gradient calculation.
  • Includes scripts for data preparation, gradient extraction, data selection, and training.

Maintenance & Community

  • Bugs or questions can be reported via GitHub Issues or by emailing Mengzhou (mengzhou@princeton.edu).
  • The project is associated with Princeton NLP.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README.

Limitations & Caveats

  • The README does not specify a license, which may impact commercial use or integration into closed-source projects.
  • The process involves multiple steps and significant computational overhead for gradient calculation and storage.
Health Check
Last commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
32 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.