Instruction dataset for LLM fine-tuning
Top 66.9% on sourcepulse
InstructionWild provides a large-scale, user-generated instruction dataset for training large language models (LLMs), aiming to replicate the diversity and quality of OpenAI's proprietary instruction dataset. It targets LLM researchers and developers seeking to improve model instruction-following capabilities beyond existing open-source datasets like Alpaca.
How It Works
The dataset is collected from user-shared ChatGPT prompts, focusing on real-world usage rather than synthetic generation. This approach yields a more diverse set of instructions, particularly in generation, open-ended question answering, and "mind storm" categories. The project also includes generating responses for these instructions using OpenAI's API, mirroring the Alpaca format for easy integration.
Quick Start & Requirements
data
and data v2
directories.Highlighted Details
Maintenance & Community
The project is maintained by Jinjie Ni, Fuzhao Xue, Kabir Jain, Mahir Hitesh Shah, Zangwei Zheng, and Prof. Yang You. Further suggestions are acknowledged from Prof. Aixin Sun and Dr. Tom Young.
Licensing & Compatibility
The repository does not explicitly state a license. Users are requested to cite the repository if data or code is used.
Limitations & Caveats
The dataset and models fine-tuned on it may exhibit limitations in counting ability, logical reasoning, multilingual performance, summarization, multi-turn chat, role-playing, and self-recognition. There's also a potential for generating fabricated explanations when presented with false facts.
1 year ago
1 day