Awesome-instruction-tuning  by zhilizju

Curated list of instruction tuning resources

Created 2 years ago
338 stars

Top 81.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository serves as a curated collection of resources for instruction tuning in large language models, targeting researchers and developers in NLP. It provides a comprehensive overview of datasets, models, papers, and repositories, aiming to facilitate the development and understanding of instruction-following capabilities in AI.

How It Works

The project categorizes instruction tuning resources, distinguishing between datasets derived from traditional NLP tasks and those generated by large language models (LLMs) themselves. It highlights key datasets like UnifiedQA, CrossFit, and Flan, along with models such as UnifiedQA, BART-CrossFit, and Flan-T5, detailing their origins, sizes, and task counts. A significant contribution is the inclusion of multilingual translation tools, enabling the adaptation of English datasets to over 100 languages using Helsinki-NLP models.

Quick Start & Requirements

  • Multilingual Translation: python translator.py <model_name> <source_data_path> (e.g., python translator.py Helsinki-NLP/opus-mt-en-zh alpaca_data.json)
  • Data Cleaning: python process.py <unprocessed_data_path> (e.g., python process.py translated_data.json)
  • Prerequisites: Python, Helsinki-NLP models (e.g., Helsinki-NLP/opus-mt-en-zh), Alpaca data format.
  • Resources: Translation quality may vary; some translations might require post-processing due to model limitations (e.g., repeated words, sentence length).

Highlighted Details

  • Comprehensive tables detailing instruction tuning datasets and models from 2020 to 2023.
  • Inclusion of multilingual capabilities with a Python script for translating datasets into over 100 languages.
  • A curated list of influential research papers on instruction tuning and related topics.
  • A collection of relevant repositories for further exploration, including frameworks like OpenICL.

Maintenance & Community

The repository is community-driven, with contributions from various individuals. It links to related projects and concepts like ICL, prompt-in-context-learning, and Chain-of-Thoughts. Specific community channels or active maintainer information are not detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license for the curated list or the provided scripts. Users should verify the licensing of individual datasets, models, and repositories referenced within the collection. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The provided translation tool's output quality is dependent on the Helsinki-NLP models used and may contain noise or require post-processing. The README does not specify the license for the curated content or the scripts, posing potential compatibility issues for commercial applications.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
3 more.

Alpaca-CoT by PhoebusSi

0.1%
3k
IFT platform for instruction collection, parameter-efficient methods, and LLMs
Created 2 years ago
Updated 1 year ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code), and
11 more.

open-instruct by allenai

0.7%
3k
Training codebase for instruction-following language models
Created 2 years ago
Updated 17 hours ago
Feedback? Help us improve.