intel-extension-for-transformers  by intel

Transformer toolkit for GenAI/LLM acceleration on Intel platforms

Created 2 years ago
2,170 stars

Top 20.8% on SourcePulse

GitHubView on GitHub
Project Summary

This toolkit accelerates Transformer-based models, particularly Large Language Models (LLMs), across Intel hardware (Gaudi2, CPUs, GPUs). It targets developers and researchers seeking to optimize LLM performance through advanced compression techniques and provides a customizable chatbot framework, NeuralChat.

How It Works

The extension integrates with Hugging Face Transformers, leveraging Intel® Neural Compressor for model compression. It employs advanced software optimizations and custom runtimes, including techniques from published research like "Fast Distilbert on CPUs" and "QuaLA-MiniLM," to achieve efficient inference and fine-tuning. It also offers a C/C++ inference engine with weight-only quantization kernels for Intel CPUs and GPUs.

Quick Start & Requirements

  • Install: pip install intel-extension-for-transformers
  • Prerequisites: Specific PyTorch versions (e.g., 2.0.1+cpu, 2.0.1a0 for GPU), Transformers (4.35.2 for CPU, 4.31.0 for Intel GPU), and Intel-specific drivers (Synapse AI, intel-level-zero-gpu) are required. Detailed requirements are in requirements_cpu.txt, requirements_hpu.txt, and requirements_xpu.txt.
  • Validated OS: Ubuntu 20.04/22.04, Centos 8.
  • Docs: Installation Guide, Examples

Highlighted Details

  • Supports INT4 inference on Intel GPUs (Data Center GPU Max Series, Arc A-Series) and CPUs (including Meteor Lake).
  • Features NeuralChat, a chatbot framework with plugins for retrieval, speech, caching, and security.
  • Offers OpenAI-compatible RESTful APIs for the NeuralChat server.
  • Provides optimized inference for various LLMs (e.g., Llama, Qwen, GPT-J) using weight-only quantization and supports CPU instruction sets like AMX, VNNI, AVX512F, AVX2.

Maintenance & Community

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Some features, like FP8 fine-tuning on Gaudi2 and INT8 inference on Intel Data Center GPU Max Series, are marked as "WIP" (Work in Progress).
  • GGUF format support is limited to Q4_0/Q5_0/Q8_0 variants.
Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.2%
2k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 15 hours ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.