Compressed language models via pruning/distillation
Top 81.1% on sourcepulse
Minitron is a family of compressed small language models (SLMs) derived from larger models through pruning and knowledge distillation. It targets researchers and developers seeking efficient, high-performance language models with reduced computational requirements, offering state-of-the-art accuracy for their size.
How It Works
Minitron models are created by first pruning embedding size, attention heads, and MLP intermediate dimensions from a base model. This is followed by continued training with knowledge distillation. This approach significantly reduces training costs (up to 40x fewer tokens) and results in models that outperform other compression techniques and achieve competitive accuracy against larger, uncompressed models.
Quick Start & Requirements
Mistral-NeMo-Minitron-8B-Base
). Usage instructions are in model cards.nvcr.io/nvidia/nemo:24.05
container. Requires mounting TensorRT-Model-Optimizer and model directories.transformers
, nvidia-modelopt
, TensorRT-LLM. GPU with CUDA is recommended for inference and export.Highlighted Details
.nemo
checkpoint format for NeMo and can be exported to TensorRT-LLM.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
8 months ago
1 day