magpie  by magpie-align

Synthetic data pipeline for LLM alignment (ICLR 2025 paper)

Created 1 year ago
770 stars

Top 45.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Magpie is an open-source pipeline for synthesizing high-quality alignment data for Large Language Models (LLMs) from scratch. It targets researchers and developers aiming to democratize AI by providing transparent and scalable methods for creating instruction-following and preference datasets, reducing reliance on costly human annotation or proprietary data.

How It Works

Magpie leverages the auto-regressive nature of aligned LLMs to generate both user queries and model responses by simply providing the LLM's pre-query template. This "prompting with nothing" approach bypasses the need for manual prompt engineering or seed questions, enabling efficient, large-scale data generation. The pipeline includes steps for data filtering, tagging, and conversion to standard formats like ShareGPT.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n magpie python=3.10), activate it (conda activate magpie), and install requirements (pip install -r requirements.txt).
  • Prerequisites: Access to Huggingface models (requires huggingface-cli login), Python 3.10, and GPUs (RTX 4090 24G tested for 8B models; larger models may require multiple A100s).
  • Demo: A toy example is available in demo.ipynb.
  • Batched Generation: Run bash scripts/magpie.sh for single-turn data generation.

Highlighted Details

  • Generates up to 4 million instructions and responses, with 300K high-quality instances selected.
  • Fine-tuned Llama-3-8B-Base models with Magpie data show performance comparable to official Llama-3-8B-Instruct on benchmarks like AlpacaEval, ArenaHard, and WildBench.
  • Supports data generation for Llama-3, Qwen2, Phi 3, and Gemma-2 model families.
  • Includes scripts for multi-turn conversation extension, data tagging (quality, difficulty, safety), and preference data generation.

Maintenance & Community

The project is associated with the ICLR 2025 paper "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing." Recent updates include new datasets for Llama-3.3, Qwen2.5, and reasoning tasks, as well as the release of MagpieLM-Chat models.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, the project aims to democratize AI and mentions using Llama-3 models, which have their own specific licenses. Compatibility for commercial use or closed-source linking would require clarification on the project's licensing.

Limitations & Caveats

While Magpie demonstrates strong performance, some models (e.g., Gemma-1.1, Mistral, Yi) are marked as requiring additional logit processors or filters for optimal results, indicating potential compatibility or quality tuning needs. The README does not detail specific hardware requirements for all supported models, and larger model generation may necessitate significant GPU resources.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

0.8%
1k
Synthetic data CLI tool for LLM fine-tuning
Created 5 months ago
Updated 1 month ago
Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Albert Gu Albert Gu(Cofounder of Cartesia; Professor at CMU), and
10 more.

open-llms by eugeneyan

0.1%
12k
Curated list of commercially-usable open LLMs
Created 2 years ago
Updated 7 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), John Yang John Yang(Coauthor of SWE-bench, SWE-agent), and
28 more.

stanford_alpaca by tatsu-lab

0.1%
30k
Instruction-following LLaMA model training and data generation
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.