InstructionWild by XueFuzhao

Instruction dataset for LLM fine-tuning

Created 2 years ago

459 stars

Top 65.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

InstructionWild provides a large-scale, user-generated instruction dataset for training large language models (LLMs), aiming to replicate the diversity and quality of OpenAI's proprietary instruction dataset. It targets LLM researchers and developers seeking to improve model instruction-following capabilities beyond existing open-source datasets like Alpaca.

How It Works

The dataset is collected from user-shared ChatGPT prompts, focusing on real-world usage rather than synthetic generation. This approach yields a more diverse set of instructions, particularly in generation, open-ended question answering, and "mind storm" categories. The project also includes generating responses for these instructions using OpenAI's API, mirroring the Alpaca format for easy integration.

Quick Start & Requirements

Data is available in the data and data v2 directories.
No specific installation commands are provided, as this is a dataset release.
Requires Python environment for data processing and LLM fine-tuning.

Highlighted Details

InstructionWild v2 contains over 110K user-based instructions, with no self-instruct generation.
Includes fine-grained labeling for instruction types and special tags in a subset of v2.
English and Chinese versions are available, with 52K instructions each in v2.
Demonstrated improvement in generation, open QA, and mind storm capabilities when used to fine-tune models like ColossalChat-7B.

Maintenance & Community

The project is maintained by Jinjie Ni, Fuzhao Xue, Kabir Jain, Mahir Hitesh Shah, Zangwei Zheng, and Prof. Yang You. Further suggestions are acknowledged from Prof. Aixin Sun and Dr. Tom Young.

Licensing & Compatibility

The repository does not explicitly state a license. Users are requested to cite the repository if data or code is used.

Limitations & Caveats

The dataset and models fine-tuned on it may exhibit limitations in counting ability, logical reasoning, multilingual performance, summarization, multi-turn chat, role-playing, and self-recognition. There's also a potential for generating fabricated explanations when presented with false facts.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days