Awesome-instruction-tuning by zhilizju

Curated list of instruction tuning resources

Created 2 years ago

345 stars

Top 80.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Shawn Wang

Editor of Latent Space

Project Summary

This repository serves as a curated collection of resources for instruction tuning in large language models, targeting researchers and developers in NLP. It provides a comprehensive overview of datasets, models, papers, and repositories, aiming to facilitate the development and understanding of instruction-following capabilities in AI.

How It Works

The project categorizes instruction tuning resources, distinguishing between datasets derived from traditional NLP tasks and those generated by large language models (LLMs) themselves. It highlights key datasets like UnifiedQA, CrossFit, and Flan, along with models such as UnifiedQA, BART-CrossFit, and Flan-T5, detailing their origins, sizes, and task counts. A significant contribution is the inclusion of multilingual translation tools, enabling the adaptation of English datasets to over 100 languages using Helsinki-NLP models.

Quick Start & Requirements

Multilingual Translation: python translator.py <model_name> <source_data_path> (e.g., python translator.py Helsinki-NLP/opus-mt-en-zh alpaca_data.json)
Data Cleaning: python process.py <unprocessed_data_path> (e.g., python process.py translated_data.json)
Prerequisites: Python, Helsinki-NLP models (e.g., Helsinki-NLP/opus-mt-en-zh), Alpaca data format.
Resources: Translation quality may vary; some translations might require post-processing due to model limitations (e.g., repeated words, sentence length).

Highlighted Details

Comprehensive tables detailing instruction tuning datasets and models from 2020 to 2023.
Inclusion of multilingual capabilities with a Python script for translating datasets into over 100 languages.
A curated list of influential research papers on instruction tuning and related topics.
A collection of relevant repositories for further exploration, including frameworks like OpenICL.

Maintenance & Community

The repository is community-driven, with contributions from various individuals. It links to related projects and concepts like ICL, prompt-in-context-learning, and Chain-of-Thoughts. Specific community channels or active maintainer information are not detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license for the curated list or the provided scripts. Users should verify the licensing of individual datasets, models, and repositories referenced within the collection. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The provided translation tool's output quality is dependent on the Helsinki-NLP models used and may contain noise or require post-processing. The README does not specify the license for the curated content or the scripts, posing potential compatibility issues for commercial applications.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days