Eurus is a suite of open-source Large Language Models (LLMs) and a reward model optimized for complex reasoning tasks. It targets researchers and developers seeking high-performance models for coding, math, and logical problem-solving, offering significant improvements over existing open-source alternatives and even surpassing GPT-3.5 Turbo in certain reasoning benchmarks.
How It Works
Eurus models are fine-tuned using the UltraInteract dataset, a novel alignment dataset designed for complex reasoning. UltraInteract structures data as preference trees, capturing step-by-step reasoning chains, multi-turn interactions with critiques, and pairwise preference data. This approach allows for both Supervised Fine-Tuning (SFT) on correct reasoning paths and Preference Learning (PL) on comparative data, leading to enhanced reasoning capabilities and instruction following.
Quick Start & Requirements
- Installation: Models are available via Hugging Face Transformers.
- Dependencies: Requires PyTorch and Hugging Face libraries. Specific model variants may have significant VRAM requirements (e.g., 70B models).
- Resources: Access to substantial GPU resources is recommended for running larger models.
- Links: Eurus Collection, UltraInteract Dataset
Highlighted Details
- Eurus-70B achieves 33.3% pass@1 on LeetCode and 32.6% on TheoremQA, outperforming open-source models by over 13.3%.
- Eurux-8x22B variants demonstrate strong reasoning, chat, and instruction-following capabilities.
- Eurus-RM-7B shows strong preference modeling performance, outperforming GPT-4 on certain reasoning tasks.
- The UltraInteract dataset includes 86k instructions, 286k correct answers, and 219k preference pairs.
Maintenance & Community
- The project is associated with OpenBMB and has contributions from multiple researchers.
- Updates are provided via news releases on the project page.
Licensing & Compatibility
- The models are released under a permissive license (likely Apache 2.0, but requires verification for specific model weights).
- Compatible with standard LLM inference frameworks.
Limitations & Caveats
- While performance is strong, the specific benchmark results should be independently verified.
- Running the larger 70B and 8x22B models requires significant computational resources.