Awesome-instruction-tuning  by zhilizju

Curated list of instruction tuning resources

created 2 years ago
335 stars

Top 83.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository serves as a curated collection of resources for instruction tuning in large language models, targeting researchers and developers in NLP. It provides a comprehensive overview of datasets, models, papers, and repositories, aiming to facilitate the development and understanding of instruction-following capabilities in AI.

How It Works

The project categorizes instruction tuning resources, distinguishing between datasets derived from traditional NLP tasks and those generated by large language models (LLMs) themselves. It highlights key datasets like UnifiedQA, CrossFit, and Flan, along with models such as UnifiedQA, BART-CrossFit, and Flan-T5, detailing their origins, sizes, and task counts. A significant contribution is the inclusion of multilingual translation tools, enabling the adaptation of English datasets to over 100 languages using Helsinki-NLP models.

Quick Start & Requirements

  • Multilingual Translation: python translator.py <model_name> <source_data_path> (e.g., python translator.py Helsinki-NLP/opus-mt-en-zh alpaca_data.json)
  • Data Cleaning: python process.py <unprocessed_data_path> (e.g., python process.py translated_data.json)
  • Prerequisites: Python, Helsinki-NLP models (e.g., Helsinki-NLP/opus-mt-en-zh), Alpaca data format.
  • Resources: Translation quality may vary; some translations might require post-processing due to model limitations (e.g., repeated words, sentence length).

Highlighted Details

  • Comprehensive tables detailing instruction tuning datasets and models from 2020 to 2023.
  • Inclusion of multilingual capabilities with a Python script for translating datasets into over 100 languages.
  • A curated list of influential research papers on instruction tuning and related topics.
  • A collection of relevant repositories for further exploration, including frameworks like OpenICL.

Maintenance & Community

The repository is community-driven, with contributions from various individuals. It links to related projects and concepts like ICL, prompt-in-context-learning, and Chain-of-Thoughts. Specific community channels or active maintainer information are not detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license for the curated list or the provided scripts. Users should verify the licensing of individual datasets, models, and repositories referenced within the collection. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The provided translation tool's output quality is dependent on the Helsinki-NLP models used and may contain noise or require post-processing. The README does not specify the license for the curated content or the scripts, posing potential compatibility issues for commercial applications.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), and
4 more.

awesome-nlp by keon

0.1%
17k
Curated list of NLP resources
created 9 years ago
updated 1 year ago
Feedback? Help us improve.