Cherry_LLM  by tianyi-lab

Research paper for LLM instruction tuning via self-guided data selection

created 1 year ago
381 stars

Top 76.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a self-guided methodology for Large Language Models (LLMs) to autonomously select high-quality instruction-tuning data, reducing manual curation. It introduces the Instruction-Following Difficulty (IFD) score, enabling LLMs to filter "cherry data" based on their own perceived difficulty, benefiting researchers and practitioners aiming to improve LLM performance efficiently.

How It Works

The core approach involves a three-phase process: "Learning from Brief Experience," where an LLM is exposed to a subset of the data; "Evaluating Based on Experience," using the novel IFD score to quantify sample difficulty by comparing response generation capabilities; and "Retraining from Self-Guided Experience," where the LLM is fine-tuned on cherry-picked data with high IFD scores. This self-supervised method avoids reliance on external models for data quality assessment.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt (or manually install tqdm, scikit-learn).
  • Requires a Hugging Face converted LLaMA checkpoint and tokenizer for initial data processing.
  • The cherry_seletion/data_analysis.py script is used for initial data processing and score calculation, followed by cherry_seletion/data_by_cluster.py or cherry_seletion/data_by_IFD.py for data selection.
  • Official quick-start and detailed usage examples are provided in the README.

Highlighted Details

  • Achieves comparable performance to models trained on full datasets using only 5-10% of the data (Alpaca, WizardLM).
  • Demonstrates strong consistency in IFD scores across LLMs of different sizes, enabling efficient filtering with smaller models like GPT-2 (Superfiltering).
  • The IFD score itself offers insights into data types beneficial for instruction tuning.
  • Includes pre-trained models and curated datasets (cherry_data_v1, cherry_data_v2) for LLaMA and LLaMA 2.

Maintenance & Community

The project is associated with NAACL'24 and ACL'24 publications, indicating active research. Contact information for Ming Li is provided for questions.

Licensing & Compatibility

The repository does not explicitly state a license. The code is presented for research purposes, and commercial use would require clarification.

Limitations & Caveats

The README mentions that some results are based on specific hardware limitations (7B models) and that cherry_data_v2 based on LLaMA 2 was planned for release. The exact licensing status for commercial use is not specified.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), John Yang John Yang(Author of SWE-bench, SWE-agent), and
13 more.

stanford_alpaca by tatsu-lab

0.1%
30k
Instruction-following LLaMA model training and data generation
created 2 years ago
updated 1 year ago
Feedback? Help us improve.