Minimal reproduction of DeepSeek R1 Zero for countdown/multiplication tasks
Top 4.2% on sourcepulse
TinyZero aims to reproduce the reasoning capabilities of DeepSeek R1-Zero, specifically for countdown and multiplication tasks. It targets researchers and developers interested in enhancing large language models (LLMs) with self-verification and search abilities through reinforcement learning (RL), offering a cost-effective path to advanced reasoning.
How It Works
TinyZero builds upon the veRL framework, leveraging RL to imbue a base LLM with emergent self-verification and search skills. This approach allows the model to develop sophisticated reasoning without explicit programming, potentially leading to more robust and generalizable problem-solving capabilities.
Quick Start & Requirements
conda create -n zero python=3.9
, pip install torch
, pip3 install vllm==0.6.3
, pip3 install ray verl
, pip install -e .
, pip3 install flash-attn --no-build-isolation
, pip install wandb IPython matplotlib
.python ./examples/data_preprocess/countdown.py --local_dir {path_to_your_dataset}
../scripts/train_tiny_zero.sh
with environment variables for GPU count, model path, data directory, and experiment name.critic.model.enable_gradient_checkpointing=True
.Highlighted Details
Maintenance & Community
The project is maintained by Jiayi Pan and collaborators. Further details on community channels or roadmaps are not explicitly provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README notes that Qwen2.5-0.5B base models may fail to learn reasoning. For larger models, multi-GPU setups are recommended, and gradient checkpointing might be necessary to manage VRAM. The project appears to be research-oriented, and stability for production use is not guaranteed.
3 months ago
1 day