Cherry_LLM by tianyi-lab

Research paper for LLM instruction tuning via self-guided data selection

Created 2 years ago

412 stars

Top 70.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

This repository provides a self-guided methodology for Large Language Models (LLMs) to autonomously select high-quality instruction-tuning data, reducing manual curation. It introduces the Instruction-Following Difficulty (IFD) score, enabling LLMs to filter "cherry data" based on their own perceived difficulty, benefiting researchers and practitioners aiming to improve LLM performance efficiently.

How It Works

The core approach involves a three-phase process: "Learning from Brief Experience," where an LLM is exposed to a subset of the data; "Evaluating Based on Experience," using the novel IFD score to quantify sample difficulty by comparing response generation capabilities; and "Retraining from Self-Guided Experience," where the LLM is fine-tuned on cherry-picked data with high IFD scores. This self-supervised method avoids reliance on external models for data quality assessment.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt (or manually install tqdm, scikit-learn).
Requires a Hugging Face converted LLaMA checkpoint and tokenizer for initial data processing.
The cherry_seletion/data_analysis.py script is used for initial data processing and score calculation, followed by cherry_seletion/data_by_cluster.py or cherry_seletion/data_by_IFD.py for data selection.
Official quick-start and detailed usage examples are provided in the README.

Highlighted Details

Achieves comparable performance to models trained on full datasets using only 5-10% of the data (Alpaca, WizardLM).
Demonstrates strong consistency in IFD scores across LLMs of different sizes, enabling efficient filtering with smaller models like GPT-2 (Superfiltering).
The IFD score itself offers insights into data types beneficial for instruction tuning.
Includes pre-trained models and curated datasets (cherry_data_v1, cherry_data_v2) for LLaMA and LLaMA 2.

Maintenance & Community

The project is associated with NAACL'24 and ACL'24 publications, indicating active research. Contact information for Ming Li is provided for questions.

Licensing & Compatibility

The repository does not explicitly state a license. The code is presented for research purposes, and commercial use would require clarification.

Limitations & Caveats

The README mentions that some results are based on specific hardware limitations (7B models) and that cherry_data_v2 based on LLaMA 2 was planned for release. The exact licensing status for commercial use is not specified.

Cherry_LLM by tianyi-lab

Explore Similar Projects

Reflection_Tuning by tianyi-lab

data_management_LLM by ZigeW

deita by hkust-nlp

sft_datasets by chaoswork

LESS by princeton-nlp

PandaLM by WeOpenML

awesome-llms-fine-tuning by Curated-Awesome-Lists

Hands-On-LLM-Fine-Tuning by youssefHosni

MFTCoder by codefuse-ai

SPIN by uclaml

LLMZoo by FreedomIntelligence

Alpaca-CoT by PhoebusSi