ipex-llm  by intel

LLM acceleration library for Intel XPU (GPU, NPU, CPU)

created 9 years ago
8,166 stars

Top 6.4% on sourcepulse

GitHubView on GitHub
Project Summary

This library accelerates local LLM inference and fine-tuning on Intel hardware, targeting developers and researchers seeking to leverage Intel GPUs (iGPU, Arc, Flex, Max), NPUs, and CPUs. It offers seamless integration with popular LLM frameworks and tools, enabling efficient deployment of a wide range of LLMs with advanced optimizations and low-bit quantization.

How It Works

IPEX-LLM leverages Intel's Extension for PyTorch (IPEX) to optimize LLM operations for Intel's XPU architecture. It implements state-of-the-art LLM optimizations, including low-bit quantization (INT4, FP8, FP6) and techniques like Self-Speculative Decoding, to significantly boost inference speed and reduce memory footprint. The library also supports distributed inference strategies like pipeline parallelism for running larger models across multiple Intel GPUs.

Quick Start & Requirements

  • Installation: Typically via pip install intel_extension_for_pytorch. Specific guides for Windows GPU, Linux GPU, and NPU are available.
  • Prerequisites: Python, PyTorch. Specific hardware (Intel GPU/NPU) is required for hardware acceleration. CUDA is not a primary dependency.
  • Resources: Setup time varies; running LLMs requires significant VRAM/RAM depending on the model size and quantization.
  • Links: Quickstart Guides, Verified Models

Highlighted Details

  • Supports over 70 LLM models, including Llama, Mistral, Mixtral, Gemma, and Qwen.
  • Offers low-bit quantization (INT4, FP8, FP6, INT2) for reduced memory and faster inference.
  • Provides seamless integration with llama.cpp, Ollama, HuggingFace Transformers, LangChain, LlamaIndex, vLLM, and DeepSpeed.
  • Includes support for fine-tuning techniques like LoRA, QLoRA, DPO, QA-LoRA, and ReLoRA on Intel GPUs.

Maintenance & Community

  • Actively developed by Intel.
  • Migration from bigdl-llm noted.
  • Support channels via GitHub Issues.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Performance optimizations are primarily targeted at Intel hardware; performance on non-Intel products may vary.
  • Experimental NPU support is available for Intel Core Ultra processors.
  • Some advanced features like INT2 quantization are based on specific llama.cpp mechanisms.
Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
11
Issues (30d)
20
Star History
377 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 13 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 17 hours ago
Feedback? Help us improve.