This repository provides the official implementations for MiniMax-Text-01 and MiniMax-VL-01, large-scale language and vision-language models. These models are designed for researchers and developers seeking state-of-the-art performance in long-context understanding and multimodal tasks, offering advanced architectures and competitive benchmark results.
How It Works
MiniMax-Text-01 utilizes a hybrid attention mechanism combining Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE) to achieve a 1 million token context length during training and up to 4 million during inference. It employs parallel strategies like LASP+ and ETP for efficient scaling. MiniMax-VL-01 builds upon this by integrating a Vision Transformer (ViT) and a dynamic resolution mechanism, allowing it to process images at resolutions up to 2016x2016 while maintaining a 336x336 thumbnail for efficient multimodal understanding.
Quick Start & Requirements
- Installation: Models are available via Hugging Face Transformers.
- Hardware: Requires multiple GPUs (e.g., 8 GPUs for the provided examples).
- Dependencies: PyTorch, Transformers, and potentially vLLM for deployment.
- Resources: Significant GPU memory is needed due to the large parameter counts.
- Links:
Highlighted Details
- MiniMax-Text-01: 456B total parameters, 45.9B activated per token, 1M context training, 4M inference context.
- MiniMax-VL-01: Integrates a 303M ViT with MiniMax-Text-01, supports dynamic image resolutions up to 2016x2016.
- Strong performance across various academic benchmarks (MMLU, GSM8k, HumanEval) and long-context evaluations (Needle In A Haystack, LongBench).
- Supports INT8 quantization for reduced memory footprint.
Maintenance & Community
- Official repository from MiniMax.
- Contact: model@minimaxi.com for API and server inquiries.
Licensing & Compatibility
- The specific license is not explicitly stated in the README, but models are available on Hugging Face, implying a permissive license for research and potentially commercial use. Further clarification is recommended.
Limitations & Caveats
- The provided quick-start examples assume a distributed setup across multiple GPUs, indicating significant hardware requirements for effective use.
- The README does not detail specific installation steps beyond Hugging Face model loading, and deployment guidance points to external tools like vLLM.