MindPipe by MAC-AutoML

LLM/LVLM compression and evaluation framework

Created 5 months ago

1,012 stars

Top 36.2% on SourcePulse

Project Summary

A powerful unified framework for compressing Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), MindPipe offers a single command-line interface for advanced techniques including post-training quantization, quantization-aware training, and various pruning strategies, alongside comprehensive evaluation capabilities. Targeting researchers and engineers, MindPipe streamlines reproducible experimentation and deployment across diverse hardware, notably NVIDIA GPUs and Huawei Ascend NPUs, by providing a consistent abstraction layer.

How It Works

The framework centers around a unified main.py entrypoint that manages quantization, pruning, and evaluation pipelines. It features a robust device abstraction layer for seamless operation across GPUs and NPUs, handling cache management, synchronization, and dtype policies uniformly. MindPipe integrates a broad spectrum of 11 quantization methods (PTQ/QAT) and 7 pruning techniques, supporting both text-only and multimodal architectures via a shared model adapter. Results are systematically serialized into JSON format for straightforward aggregation and analysis.

Quick Start & Requirements

Installation: Requires conda activate mindpipe, git submodule update --init --recursive, and python -m pip install -r requirements.txt.
Prerequisites: NVIDIA GPUs or Huawei Ascend NPUs are essential. VLMEvalKit integration necessitates initializing its submodule or setting the VLMEVALKIT_ROOT environment variable.
Usage: The main.py script serves as the primary interface, configurable via numerous command-line arguments for specific compression tasks and evaluations. Example commands for full-precision evaluation, quantization, and pruning are detailed in the README.
Links: No external quick-start or demo links are provided; the README is the primary resource.

Highlighted Details

Implements 11 quantization methods (e.g., AWQ, GPTQ, FlatQuant, QLoRA) and 7 pruning methods (e.g., Wanda, SparseGPT, LLM-Pruner).
Supports a wide range of models, including LLaMA-family, Qwen (text/VL), MiniCPM-V, LLaVA, and InternVL.
Features integrated VLMEvalKit for multimodal evaluation, with recent successful validation of AWQ W4A16 on key VLM benchmarks.
Ensures reproducibility through shared utilities for model loading, dataset handling, and result serialization.

Maintenance & Community

The provided README does not contain information regarding specific maintainers, community channels (e.g., Discord, Slack), or project sponsorships.

Licensing & Compatibility

The README does not specify the software license or provide details on compatibility for commercial use or integration with closed-source projects.

Limitations & Caveats

Certain algorithms like QuaRot, SpinQuant, and MQuant are not yet marked as NPU-ready. QA-LoRA is a CUDA-only implementation and does not produce AutoGPTQ packed checkpoints. QLoRA's NPU support relies on an in-tree fake-quant fallback mechanism. Model reload functionality after applying custom runtime wrappers is method-dependent.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days