Discover and explore top open-source AI tools and projects—updated daily.
Cornell-RelaxMLCode for LLM quantization research
Top 74.1% on SourcePulse
QuIP provides code for 2-bit quantization of large language models (LLMs) using an "incoherence processing" technique, enabling significant model compression with minimal performance degradation. It's targeted at researchers and engineers working with LLMs who need to reduce memory footprint and inference costs. The primary benefit is achieving near FP16 performance at 2-bit precision.
How It Works
QuIP builds upon the OPTQ repository, introducing "incoherence processing" which involves specific pre- and post-processing steps (--incoh_processing meta-argument). This approach, detailed in their paper, allows for stable quantization to 2 bits by managing quantization errors. The repository also includes implementations of various quantization algorithms like LDLQ, LDLQ_RG, and GPTQ, with a focus on theoretical analysis and empirical verification of their equivalence.
Quick Start & Requirements
opt.py, main.py).facebook/opt-125m).--lazy_batch for memory efficiency.Highlighted Details
Maintenance & Community
The project is associated with Cornell-RelaxML. Further community or maintenance details are not explicitly provided in the README.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Quantization algorithms can be slow on larger models due to low compute-to-memory-access ratios. The README mentions evaluation with a fixed context length (2048) for Llama-2, which may need adjustment.
1 year ago
Inactive
Vahe1994
Cornell-RelaxML
spcl
IST-DASLab
mit-han-lab
qwopqwop200
lyogavin