Minitron  by NVlabs

Compressed language models via pruning/distillation

created 1 year ago
347 stars

Top 81.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Minitron is a family of compressed small language models (SLMs) derived from larger models through pruning and knowledge distillation. It targets researchers and developers seeking efficient, high-performance language models with reduced computational requirements, offering state-of-the-art accuracy for their size.

How It Works

Minitron models are created by first pruning embedding size, attention heads, and MLP intermediate dimensions from a base model. This is followed by continued training with knowledge distillation. This approach significantly reduces training costs (up to 40x fewer tokens) and results in models that outperform other compression techniques and achieve competitive accuracy against larger, uncompressed models.

Quick Start & Requirements

  • Hugging Face: Models are available on Hugging Face (e.g., Mistral-NeMo-Minitron-8B-Base). Usage instructions are in model cards.
  • NeMo Container: For TensorRT-LLM export, use the nvcr.io/nvidia/nemo:24.05 container. Requires mounting TensorRT-Model-Optimizer and model directories.
  • Dependencies: Python, PyTorch, Hugging Face transformers, nvidia-modelopt, TensorRT-LLM. GPU with CUDA is recommended for inference and export.
  • Resources: Exporting to TensorRT-LLM involves Docker builds and model conversion, requiring significant disk space and compute.

Highlighted Details

  • Achieves SOTA 8B model performance using only 400B tokens.
  • Offers up to 16% MMLU improvement over training from scratch.
  • Models are available in .nemo checkpoint format for NeMo and can be exported to TensorRT-LLM.
  • Supports fine-tuning with frameworks like LMFlow, including LoRA and LISA.

Maintenance & Community

  • Developed by NVIDIA (NVlabs).
  • Models are available on Hugging Face, with community-quantized FP8 versions also provided.
  • Technical report and blog posts detail the methodology and results.

Licensing & Compatibility

  • Released under the NVIDIA Open Model License Agreement. Specific terms should be reviewed for commercial use.

Limitations & Caveats

  • The NVIDIA Open Model License Agreement may have restrictions on commercial use.
  • Exporting to TensorRT-LLM requires specific NVIDIA container environments and can be complex.
Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
5 more.

Liger-Kernel by linkedin

0.6%
5k
Triton kernels for efficient LLM training
created 1 year ago
updated 2 days ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
10 more.

qlora by artidoro

0.2%
11k
Finetuning tool for quantized LLMs
created 2 years ago
updated 1 year ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 22 hours ago
Feedback? Help us improve.