Cleaned dataset for Alpaca LLM training
Top 27.3% on sourcepulse
This repository provides a cleaned and curated version of the Alpaca dataset, originally used for training Stanford's Alpaca LLM. It addresses issues like hallucinations, merged instructions, empty outputs, and incorrect answers found in the original dataset, aiming to improve the performance and reduce hallucinations in fine-tuned language models. The target audience includes researchers and developers working with LLMs who need a higher-quality instruction-following dataset.
How It Works
The project cleans and curates the original Alpaca dataset by identifying and rectifying several common data quality issues. This includes removing instructions that prompt hallucinations (e.g., referencing external URLs), merging disparate instructions, filling in empty outputs, correcting wrong answers (especially in math problems), and clarifying nonsensical instructions. The cleaned dataset is intended to yield better results in fine-tuning LLMs compared to the original, noisy dataset.
Quick Start & Requirements
max_prompt_length
for fine-tuning is 512 or higher due to increased average prompt length.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The dataset is licensed for non-commercial use only, limiting its application in commercial products. While curation is ongoing, some issues may still exist within the ~52k entries.
2 years ago
1 day