AlpacaDataCleaned  by gururise

Cleaned dataset for Alpaca LLM training

Created 2 years ago
1,571 stars

Top 26.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a cleaned and curated version of the Alpaca dataset, originally used for training Stanford's Alpaca LLM. It addresses issues like hallucinations, merged instructions, empty outputs, and incorrect answers found in the original dataset, aiming to improve the performance and reduce hallucinations in fine-tuned language models. The target audience includes researchers and developers working with LLMs who need a higher-quality instruction-following dataset.

How It Works

The project cleans and curates the original Alpaca dataset by identifying and rectifying several common data quality issues. This includes removing instructions that prompt hallucinations (e.g., referencing external URLs), merging disparate instructions, filling in empty outputs, correcting wrong answers (especially in math problems), and clarifying nonsensical instructions. The cleaned dataset is intended to yield better results in fine-tuning LLMs compared to the original, noisy dataset.

Quick Start & Requirements

  • Dataset available on Hugging Face Hub.
  • Fine-tuned Lora models (7B and 13B) are available on Hugging Face.
  • Recommended max_prompt_length for fine-tuning is 512 or higher due to increased average prompt length.

Highlighted Details

  • Benchmarks show improved performance on WikiText and MNLI metrics compared to the original Alpaca dataset.
  • The cleaned dataset demonstrates reduced hallucination rates in the HALTT4LLM benchmark.
  • Approximately 80% of math problems in the original dataset had incorrect answers, which have been addressed.
  • The dataset is US-centric, as noted in the original data generation.

Maintenance & Community

  • Contributions are encouraged via pull requests to address remaining issues.
  • Compute resources were donated by Q-Blocks Cloud.

Licensing & Compatibility

  • Code and tools are licensed under Apache-2.0.
  • The dataset itself is licensed under CC BY NC 4.0, restricting commercial use. Models trained on this dataset should also be used for research purposes only.

Limitations & Caveats

The dataset is licensed for non-commercial use only, limiting its application in commercial products. While curation is ongoing, some issues may still exist within the ~52k entries.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Travis Fischer Travis Fischer(Founder of Agentic), and
1 more.

HaluEval by RUCAIBox

0.8%
510
Benchmark dataset for LLM hallucination evaluation
Created 2 years ago
Updated 1 year ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.4%
265
Efficiently train foundation models with PyTorch
Created 1 year ago
Updated 1 month ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

InternEvo by InternLM

0.2%
407
Lightweight training framework for model pre-training
Created 1 year ago
Updated 4 weeks ago
Feedback? Help us improve.