magpie by magpie-align

Synthetic data pipeline for LLM alignment (ICLR 2025 paper)

Created 1 year ago

817 stars

Top 43.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Maxime Labonne

Head of Post-Training at Liquid AI

Project Summary

Magpie is an open-source pipeline for synthesizing high-quality alignment data for Large Language Models (LLMs) from scratch. It targets researchers and developers aiming to democratize AI by providing transparent and scalable methods for creating instruction-following and preference datasets, reducing reliance on costly human annotation or proprietary data.

How It Works

Magpie leverages the auto-regressive nature of aligned LLMs to generate both user queries and model responses by simply providing the LLM's pre-query template. This "prompting with nothing" approach bypasses the need for manual prompt engineering or seed questions, enabling efficient, large-scale data generation. The pipeline includes steps for data filtering, tagging, and conversion to standard formats like ShareGPT.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n magpie python=3.10), activate it (conda activate magpie), and install requirements (pip install -r requirements.txt).
Prerequisites: Access to Huggingface models (requires huggingface-cli login), Python 3.10, and GPUs (RTX 4090 24G tested for 8B models; larger models may require multiple A100s).
Demo: A toy example is available in demo.ipynb.
Batched Generation: Run bash scripts/magpie.sh for single-turn data generation.

Highlighted Details

Generates up to 4 million instructions and responses, with 300K high-quality instances selected.
Fine-tuned Llama-3-8B-Base models with Magpie data show performance comparable to official Llama-3-8B-Instruct on benchmarks like AlpacaEval, ArenaHard, and WildBench.
Supports data generation for Llama-3, Qwen2, Phi 3, and Gemma-2 model families.
Includes scripts for multi-turn conversation extension, data tagging (quality, difficulty, safety), and preference data generation.

Maintenance & Community

The project is associated with the ICLR 2025 paper "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing." Recent updates include new datasets for Llama-3.3, Qwen2.5, and reasoning tasks, as well as the release of MagpieLM-Chat models.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, the project aims to democratize AI and mentions using Llama-3 models, which have their own specific licenses. Compatibility for commercial use or closed-source linking would require clarification on the project's licensing.

Limitations & Caveats

While Magpie demonstrates strong performance, some models (e.g., Gemma-1.1, Mistral, Yi) are marked as requiring additional logit processors or filters for optimal results, indicating potential compatibility or quality tuning needs. The README does not detail specific hardware requirements for all supported models, and larger model generation may necessitate significant GPU resources.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days