AlpacaDataCleaned by gururise

Cleaned dataset for Alpaca LLM training

Created 2 years ago

1,581 stars

Top 26.3% on SourcePulse

View on GitHub

9 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

Omar Sanseviero

DevRel at Google DeepMind

Junyang Lin

Core Maintainer at Alibaba Qwen

Wing Lian

Founder of Axolotl AI

and 5 more!

Project Summary

This repository provides a cleaned and curated version of the Alpaca dataset, originally used for training Stanford's Alpaca LLM. It addresses issues like hallucinations, merged instructions, empty outputs, and incorrect answers found in the original dataset, aiming to improve the performance and reduce hallucinations in fine-tuned language models. The target audience includes researchers and developers working with LLMs who need a higher-quality instruction-following dataset.

How It Works

The project cleans and curates the original Alpaca dataset by identifying and rectifying several common data quality issues. This includes removing instructions that prompt hallucinations (e.g., referencing external URLs), merging disparate instructions, filling in empty outputs, correcting wrong answers (especially in math problems), and clarifying nonsensical instructions. The cleaned dataset is intended to yield better results in fine-tuning LLMs compared to the original, noisy dataset.

Quick Start & Requirements

Dataset available on Hugging Face Hub.
Fine-tuned Lora models (7B and 13B) are available on Hugging Face.
Recommended max_prompt_length for fine-tuning is 512 or higher due to increased average prompt length.

Highlighted Details

Benchmarks show improved performance on WikiText and MNLI metrics compared to the original Alpaca dataset.
The cleaned dataset demonstrates reduced hallucination rates in the HALTT4LLM benchmark.
Approximately 80% of math problems in the original dataset had incorrect answers, which have been addressed.
The dataset is US-centric, as noted in the original data generation.

Maintenance & Community

Contributions are encouraged via pull requests to address remaining issues.
Compute resources were donated by Q-Blocks Cloud.

Licensing & Compatibility

Code and tools are licensed under Apache-2.0.
The dataset itself is licensed under CC BY NC 4.0, restricting commercial use. Models trained on this dataset should also be used for research purposes only.

Limitations & Caveats

The dataset is licensed for non-commercial use only, limiting its application in commercial products. While curation is ongoing, some issues may still exist within the ~52k entries.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days