ipex-llm  by intel

LLM acceleration library for Intel XPU (GPU, NPU, CPU)

Created 9 years ago
8,431 stars

Top 6.1% on SourcePulse

GitHubView on GitHub
Project Summary

This library accelerates local LLM inference and fine-tuning on Intel hardware, targeting developers and researchers seeking to leverage Intel GPUs (iGPU, Arc, Flex, Max), NPUs, and CPUs. It offers seamless integration with popular LLM frameworks and tools, enabling efficient deployment of a wide range of LLMs with advanced optimizations and low-bit quantization.

How It Works

IPEX-LLM leverages Intel's Extension for PyTorch (IPEX) to optimize LLM operations for Intel's XPU architecture. It implements state-of-the-art LLM optimizations, including low-bit quantization (INT4, FP8, FP6) and techniques like Self-Speculative Decoding, to significantly boost inference speed and reduce memory footprint. The library also supports distributed inference strategies like pipeline parallelism for running larger models across multiple Intel GPUs.

Quick Start & Requirements

  • Installation: Typically via pip install intel_extension_for_pytorch. Specific guides for Windows GPU, Linux GPU, and NPU are available.
  • Prerequisites: Python, PyTorch. Specific hardware (Intel GPU/NPU) is required for hardware acceleration. CUDA is not a primary dependency.
  • Resources: Setup time varies; running LLMs requires significant VRAM/RAM depending on the model size and quantization.
  • Links: Quickstart Guides, Verified Models

Highlighted Details

  • Supports over 70 LLM models, including Llama, Mistral, Mixtral, Gemma, and Qwen.
  • Offers low-bit quantization (INT4, FP8, FP6, INT2) for reduced memory and faster inference.
  • Provides seamless integration with llama.cpp, Ollama, HuggingFace Transformers, LangChain, LlamaIndex, vLLM, and DeepSpeed.
  • Includes support for fine-tuning techniques like LoRA, QLoRA, DPO, QA-LoRA, and ReLoRA on Intel GPUs.

Maintenance & Community

  • Actively developed by Intel.
  • Migration from bigdl-llm noted.
  • Support channels via GitHub Issues.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Performance optimizations are primarily targeted at Intel hardware; performance on non-Intel products may vary.
  • Experimental NPU support is available for Intel Core Ultra processors.
  • Some advanced features like INT2 quantization are based on specific llama.cpp mechanisms.
Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
12
Star History
79 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.1%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 3 weeks ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
40 more.

unsloth by unslothai

0.6%
48k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 1 year ago
Updated 5 hours ago
Feedback? Help us improve.