InstructionWild  by XueFuzhao

Instruction dataset for LLM fine-tuning

created 2 years ago
459 stars

Top 66.9% on sourcepulse

GitHubView on GitHub
Project Summary

InstructionWild provides a large-scale, user-generated instruction dataset for training large language models (LLMs), aiming to replicate the diversity and quality of OpenAI's proprietary instruction dataset. It targets LLM researchers and developers seeking to improve model instruction-following capabilities beyond existing open-source datasets like Alpaca.

How It Works

The dataset is collected from user-shared ChatGPT prompts, focusing on real-world usage rather than synthetic generation. This approach yields a more diverse set of instructions, particularly in generation, open-ended question answering, and "mind storm" categories. The project also includes generating responses for these instructions using OpenAI's API, mirroring the Alpaca format for easy integration.

Quick Start & Requirements

  • Data is available in the data and data v2 directories.
  • No specific installation commands are provided, as this is a dataset release.
  • Requires Python environment for data processing and LLM fine-tuning.

Highlighted Details

  • InstructionWild v2 contains over 110K user-based instructions, with no self-instruct generation.
  • Includes fine-grained labeling for instruction types and special tags in a subset of v2.
  • English and Chinese versions are available, with 52K instructions each in v2.
  • Demonstrated improvement in generation, open QA, and mind storm capabilities when used to fine-tune models like ColossalChat-7B.

Maintenance & Community

The project is maintained by Jinjie Ni, Fuzhao Xue, Kabir Jain, Mahir Hitesh Shah, Zangwei Zheng, and Prof. Yang You. Further suggestions are acknowledged from Prof. Aixin Sun and Dr. Tom Young.

Licensing & Compatibility

The repository does not explicitly state a license. Users are requested to cite the repository if data or code is used.

Limitations & Caveats

The dataset and models fine-tuned on it may exhibit limitations in counting ability, logical reasoning, multilingual performance, summarization, multi-turn chat, role-playing, and self-recognition. There's also a potential for generating fabricated explanations when presented with false facts.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.