LIMO by GAIR-NLP

Reasoning model using less data

Created 11 months ago

1,061 stars

Top 35.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Founder of Axolotl AI

Project Summary

LIMO offers a novel approach to mathematical reasoning in large language models, demonstrating that high-quality, curated data can significantly outperform massive datasets. It targets researchers and developers aiming to improve LLM performance on complex reasoning tasks with greater data efficiency.

How It Works

LIMO challenges the "more data is always better" paradigm by using a small, high-quality dataset (817 samples) to achieve state-of-the-art results on mathematical reasoning benchmarks. This curated approach focuses on data quality over quantity, leading to improved generalization and efficiency.

Quick Start & Requirements

Hugging Face Transformers: pip install transformers
VLLM: pip install vllm
Model: Available on Hugging Face (GAIR/LIMO).
Backbone: Qwen2.5-32B-Instruct.
Compatibility: HF Transformers, VLLM, TensorRT-LLM.
Quick Start: Python code examples provided for both Transformers and VLLM.
Training: Utilizes the LLaMA-Factory framework. Requires data preparation and configuration.
Evaluation: Scripts available for inference and evaluation (rule-based and model-based).

Highlighted Details

Achieves SOTA on multiple benchmarks (AIME24, MATH500, AMC23, etc.) with only 817 training samples.
Demonstrates strong generalization capabilities, outperforming models trained on 800k+ samples in some cases.
Released high-quality datasets and evaluation tools for reproducibility.
Recent updates include performance on AIME 2025 evaluation (44.6 score with 817 samples).

Maintenance & Community

Project appears actively maintained with recent updates (Feb 2025).
Paper available: https://arxiv.org/abs/2502.03387
Dataset available on Hugging Face.

Licensing & Compatibility

MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

The primary model is a 32B parameter model, requiring significant computational resources for inference and training.
While LIMO shows strong performance on many benchmarks, it notes a slight decrease in performance on Minerva and GPQA compared to previous SOTA.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days