Discover and explore top open-source AI tools and projects—updated daily.
NVlabsEfficient RL for large language models on single GPUs
Top 73.5% on SourcePulse
Quantization-enhanced Reinforcement Learning (QeRL) addresses the significant computational demands of applying reinforcement learning to large language models (LLMs). It enables the training of up to 32B parameter LLMs on a single H100 GPU, offering a low-cost and efficient alternative for researchers and engineers. QeRL accelerates RL training, improves exploration, and achieves performance comparable to full-parameter fine-tuning.
How It Works
QeRL integrates NVFP4 quantization with Low-Rank Adaptation (LoRA) to drastically reduce memory overhead and speed up the rollout phase of RL training. A key insight is that quantization noise inherently increases policy entropy, which enhances exploration during RL training, leading to the discovery of better strategies. The framework further optimizes this with an Adaptive Quantization Noise (AQN) mechanism that dynamically adjusts noise levels. This approach yields over 1.5x speedup in rollouts and enables training of larger models on constrained hardware.
Quick Start & Requirements
conda create -n qerl python=3.10 -y, conda activate qerl), install CUDA (conda install nvidia/label/cuda-12.4.1::cuda, conda install -c nvidia/label/cuda-12.4.1 cudatoolkit), and run sh setup_env.sh. A separate environment for quantization (llmcompressor) requires Python 3.12.Highlighted Details
Maintenance & Community
No specific details on maintainers, community channels (like Discord/Slack), or roadmaps were found in the provided README.
Licensing & Compatibility
The QeRL code is released under the Apache 2.0 License, which is permissive for commercial use and integration into closed-source projects.
Limitations & Caveats
The README notes that hardware setups other than the tested ones might work but have not been verified. Additionally, prefill logits computation currently requires dequantization, which is identified as an area for future optimization.
2 weeks ago
Inactive
jiaweizzhao
mit-han-lab
bghira
SJTU-IPADS
huggingface
unslothai