AlpacaDataCleaned  by gururise

Cleaned dataset for Alpaca LLM training

created 2 years ago
1,563 stars

Top 27.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a cleaned and curated version of the Alpaca dataset, originally used for training Stanford's Alpaca LLM. It addresses issues like hallucinations, merged instructions, empty outputs, and incorrect answers found in the original dataset, aiming to improve the performance and reduce hallucinations in fine-tuned language models. The target audience includes researchers and developers working with LLMs who need a higher-quality instruction-following dataset.

How It Works

The project cleans and curates the original Alpaca dataset by identifying and rectifying several common data quality issues. This includes removing instructions that prompt hallucinations (e.g., referencing external URLs), merging disparate instructions, filling in empty outputs, correcting wrong answers (especially in math problems), and clarifying nonsensical instructions. The cleaned dataset is intended to yield better results in fine-tuning LLMs compared to the original, noisy dataset.

Quick Start & Requirements

  • Dataset available on Hugging Face Hub.
  • Fine-tuned Lora models (7B and 13B) are available on Hugging Face.
  • Recommended max_prompt_length for fine-tuning is 512 or higher due to increased average prompt length.

Highlighted Details

  • Benchmarks show improved performance on WikiText and MNLI metrics compared to the original Alpaca dataset.
  • The cleaned dataset demonstrates reduced hallucination rates in the HALTT4LLM benchmark.
  • Approximately 80% of math problems in the original dataset had incorrect answers, which have been addressed.
  • The dataset is US-centric, as noted in the original data generation.

Maintenance & Community

  • Contributions are encouraged via pull requests to address remaining issues.
  • Compute resources were donated by Q-Blocks Cloud.

Licensing & Compatibility

  • Code and tools are licensed under Apache-2.0.
  • The dataset itself is licensed under CC BY NC 4.0, restricting commercial use. Models trained on this dataset should also be used for research purposes only.

Limitations & Caveats

The dataset is licensed for non-commercial use only, limiting its application in commercial products. While curation is ongoing, some issues may still exist within the ~52k entries.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.