Discover and explore top open-source AI tools and projects—updated daily.
NVlabsCompressed language models via pruning/distillation
Top 78.6% on SourcePulse
Minitron is a family of compressed small language models (SLMs) derived from larger models through pruning and knowledge distillation. It targets researchers and developers seeking efficient, high-performance language models with reduced computational requirements, offering state-of-the-art accuracy for their size.
How It Works
Minitron models are created by first pruning embedding size, attention heads, and MLP intermediate dimensions from a base model. This is followed by continued training with knowledge distillation. This approach significantly reduces training costs (up to 40x fewer tokens) and results in models that outperform other compression techniques and achieve competitive accuracy against larger, uncompressed models.
Quick Start & Requirements
Mistral-NeMo-Minitron-8B-Base). Usage instructions are in model cards.nvcr.io/nvidia/nemo:24.05 container. Requires mounting TensorRT-Model-Optimizer and model directories.transformers, nvidia-modelopt, TensorRT-LLM. GPU with CUDA is recommended for inference and export.Highlighted Details
.nemo checkpoint format for NeMo and can be exported to TensorRT-LLM.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
11 months ago
Inactive
segmind
princeton-nlp
XueFuzhao
dkozlov