Minitron  by NVlabs

Compressed language models via pruning/distillation

Created 1 year ago
352 stars

Top 79.1% on SourcePulse

GitHubView on GitHub
Project Summary

Minitron is a family of compressed small language models (SLMs) derived from larger models through pruning and knowledge distillation. It targets researchers and developers seeking efficient, high-performance language models with reduced computational requirements, offering state-of-the-art accuracy for their size.

How It Works

Minitron models are created by first pruning embedding size, attention heads, and MLP intermediate dimensions from a base model. This is followed by continued training with knowledge distillation. This approach significantly reduces training costs (up to 40x fewer tokens) and results in models that outperform other compression techniques and achieve competitive accuracy against larger, uncompressed models.

Quick Start & Requirements

  • Hugging Face: Models are available on Hugging Face (e.g., Mistral-NeMo-Minitron-8B-Base). Usage instructions are in model cards.
  • NeMo Container: For TensorRT-LLM export, use the nvcr.io/nvidia/nemo:24.05 container. Requires mounting TensorRT-Model-Optimizer and model directories.
  • Dependencies: Python, PyTorch, Hugging Face transformers, nvidia-modelopt, TensorRT-LLM. GPU with CUDA is recommended for inference and export.
  • Resources: Exporting to TensorRT-LLM involves Docker builds and model conversion, requiring significant disk space and compute.

Highlighted Details

  • Achieves SOTA 8B model performance using only 400B tokens.
  • Offers up to 16% MMLU improvement over training from scratch.
  • Models are available in .nemo checkpoint format for NeMo and can be exported to TensorRT-LLM.
  • Supports fine-tuning with frameworks like LMFlow, including LoRA and LISA.

Maintenance & Community

  • Developed by NVIDIA (NVlabs).
  • Models are available on Hugging Face, with community-quantized FP8 versions also provided.
  • Technical report and blog posts detail the methodology and results.

Licensing & Compatibility

  • Released under the NVIDIA Open Model License Agreement. Specific terms should be reviewed for commercial use.

Limitations & Caveats

  • The NVIDIA Open Model License Agreement may have restrictions on commercial use.
  • Exporting to TensorRT-LLM requires specific NVIDIA container environments and can be complex.
Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
1 more.

awesome-knowledge-distillation by dkozlov

0.1%
4k
Collection of knowledge distillation resources
Created 8 years ago
Updated 3 months ago
Feedback? Help us improve.