intel-extension-for-transformers  by intel

Transformer toolkit for GenAI/LLM acceleration on Intel platforms

created 2 years ago
2,169 stars

Top 21.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This toolkit accelerates Transformer-based models, particularly Large Language Models (LLMs), across Intel hardware (Gaudi2, CPUs, GPUs). It targets developers and researchers seeking to optimize LLM performance through advanced compression techniques and provides a customizable chatbot framework, NeuralChat.

How It Works

The extension integrates with Hugging Face Transformers, leveraging Intel® Neural Compressor for model compression. It employs advanced software optimizations and custom runtimes, including techniques from published research like "Fast Distilbert on CPUs" and "QuaLA-MiniLM," to achieve efficient inference and fine-tuning. It also offers a C/C++ inference engine with weight-only quantization kernels for Intel CPUs and GPUs.

Quick Start & Requirements

  • Install: pip install intel-extension-for-transformers
  • Prerequisites: Specific PyTorch versions (e.g., 2.0.1+cpu, 2.0.1a0 for GPU), Transformers (4.35.2 for CPU, 4.31.0 for Intel GPU), and Intel-specific drivers (Synapse AI, intel-level-zero-gpu) are required. Detailed requirements are in requirements_cpu.txt, requirements_hpu.txt, and requirements_xpu.txt.
  • Validated OS: Ubuntu 20.04/22.04, Centos 8.
  • Docs: Installation Guide, Examples

Highlighted Details

  • Supports INT4 inference on Intel GPUs (Data Center GPU Max Series, Arc A-Series) and CPUs (including Meteor Lake).
  • Features NeuralChat, a chatbot framework with plugins for retrieval, speech, caching, and security.
  • Offers OpenAI-compatible RESTful APIs for the NeuralChat server.
  • Provides optimized inference for various LLMs (e.g., Llama, Qwen, GPT-J) using weight-only quantization and supports CPU instruction sets like AMX, VNNI, AVX512F, AVX2.

Maintenance & Community

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Some features, like FP8 fine-tuning on Gaudi2 and INT8 inference on Intel Data Center GPU Max Series, are marked as "WIP" (Work in Progress).
  • GGUF format support is limited to Q4_0/Q5_0/Q8_0 variants.
Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Feedback? Help us improve.