intel-extension-for-transformers by intel

Transformer toolkit for GenAI/LLM acceleration on Intel platforms

Created 3 years ago

2,172 stars

Top 20.5% on SourcePulse

View on GitHub

3 Experts Love This Project

Ettore Di Giacinto

Author of LocalAI

Wing Lian

Founder of Axolotl AI

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This toolkit accelerates Transformer-based models, particularly Large Language Models (LLMs), across Intel hardware (Gaudi2, CPUs, GPUs). It targets developers and researchers seeking to optimize LLM performance through advanced compression techniques and provides a customizable chatbot framework, NeuralChat.

How It Works

The extension integrates with Hugging Face Transformers, leveraging Intel® Neural Compressor for model compression. It employs advanced software optimizations and custom runtimes, including techniques from published research like "Fast Distilbert on CPUs" and "QuaLA-MiniLM," to achieve efficient inference and fine-tuning. It also offers a C/C++ inference engine with weight-only quantization kernels for Intel CPUs and GPUs.

Quick Start & Requirements

Install: pip install intel-extension-for-transformers
Prerequisites: Specific PyTorch versions (e.g., 2.0.1+cpu, 2.0.1a0 for GPU), Transformers (4.35.2 for CPU, 4.31.0 for Intel GPU), and Intel-specific drivers (Synapse AI, intel-level-zero-gpu) are required. Detailed requirements are in requirements_cpu.txt, requirements_hpu.txt, and requirements_xpu.txt.
Validated OS: Ubuntu 20.04/22.04, Centos 8.
Docs: Installation Guide, Examples

Highlighted Details

Supports INT4 inference on Intel GPUs (Data Center GPU Max Series, Arc A-Series) and CPUs (including Meteor Lake).
Features NeuralChat, a chatbot framework with plugins for retrieval, speech, caching, and security.
Offers OpenAI-compatible RESTful APIs for the NeuralChat server.
Provides optimized inference for various LLMs (e.g., Llama, Qwen, GPT-J) using weight-only quantization and supports CPU instruction sets like AMX, VNNI, AVX512F, AVX2.

Maintenance & Community

Active development with recent updates supporting Qwen2 and Llama 3.
Community support via Discord: https://discord.gg/Wxk3J3ZJkU
Release notes available: https://github.com/intel/intel-extension-for-transformers/releases

Licensing & Compatibility

Licensed under the Apache License 2.0.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Some features, like FP8 fine-tuning on Gaudi2 and INT8 inference on Intel Data Center GPU Max Series, are marked as "WIP" (Work in Progress).
GGUF format support is limited to Q4_0/Q5_0/Q8_0 variants.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days