POLARIS  by ChenxinAn-fdu

Scaling RL for advanced reasoning models

created 1 month ago
559 stars

Top 57.2% on SourcePulse

GitHubView on GitHub
Project Summary

POLARIS is an open-source post-training recipe that enhances reasoning capabilities of large language models using reinforcement learning (RL). It targets researchers and developers seeking to improve model performance on complex reasoning tasks, offering significant gains over base models and outperforming leading commercial systems in benchmark evaluations.

How It Works

POLARIS employs a multi-stage RL training process, building upon existing advanced reasoning models like Qwen3. The approach involves careful data filtering and preparation, including a 53K-sample dataset, and fine-tuning with RL to scale performance. This post-training optimization strategy is designed to elevate the reasoning abilities of models without requiring foundational architectural changes.

Quick Start & Requirements

  • Install via pip: pip install -e ./verl and pip install -e ./.
  • Prerequisites: transformers==4.51.0, vllm==0.8.4, tensordict==0.6.2. Ensure VLLM_ATTENTION_BACKEND is unset.
  • Demo and evaluation scripts are provided.
  • Training requires substantial GPU resources (e.g., 32 H800 GPUs for 10 days for a 4B model).
  • Official resources: Notion, Hugging Face Models, Hugging Face Dataset.

Highlighted Details

  • Achieves significant performance improvements on reasoning benchmarks like AIME24 and AIME25.
  • Outperforms commercial models such as Claude-4-Opus and Grok-3-Beta in reported benchmarks.
  • Supports models up to 7B parameters, with plans for a Coder version.
  • Training and evaluation codebase built on Verl, with multi-node training support via Ray.

Maintenance & Community

  • Developed by HKU NLP Group and Bytedance Seed.
  • Open-sourced dataset, code, and training details.
  • Twitter for updates.

Licensing & Compatibility

  • The repository does not explicitly state a license. The underlying models (Qwen3, Deepseek) have their own licenses. Compatibility for commercial use is not specified.

Limitations & Caveats

  • Requires significant computational resources for training.
  • Evaluation suggests using higher temperatures and longer response lengths than default for optimal performance.
  • The project is presented with "Preview" model releases, indicating potential for ongoing development and changes.
Health Check
Last commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
18
Star History
143 stars in the last 30 days

Explore Similar Projects

Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code), Daniel Han Daniel Han(Cofounder of Unsloth), and
4 more.

open-instruct by allenai

0.7%
3k
Training codebase for instruction-following language models
created 2 years ago
updated 12 hours ago
Feedback? Help us improve.