intel-extension-for-transformers  by intel

Transformer toolkit for GenAI/LLM acceleration on Intel platforms

Created 3 years ago
2,165 stars

Top 20.8% on SourcePulse

GitHubView on GitHub
Project Summary

This toolkit accelerates Transformer-based models, particularly Large Language Models (LLMs), across Intel hardware (Gaudi2, CPUs, GPUs). It targets developers and researchers seeking to optimize LLM performance through advanced compression techniques and provides a customizable chatbot framework, NeuralChat.

How It Works

The extension integrates with Hugging Face Transformers, leveraging Intel® Neural Compressor for model compression. It employs advanced software optimizations and custom runtimes, including techniques from published research like "Fast Distilbert on CPUs" and "QuaLA-MiniLM," to achieve efficient inference and fine-tuning. It also offers a C/C++ inference engine with weight-only quantization kernels for Intel CPUs and GPUs.

Quick Start & Requirements

  • Install: pip install intel-extension-for-transformers
  • Prerequisites: Specific PyTorch versions (e.g., 2.0.1+cpu, 2.0.1a0 for GPU), Transformers (4.35.2 for CPU, 4.31.0 for Intel GPU), and Intel-specific drivers (Synapse AI, intel-level-zero-gpu) are required. Detailed requirements are in requirements_cpu.txt, requirements_hpu.txt, and requirements_xpu.txt.
  • Validated OS: Ubuntu 20.04/22.04, Centos 8.
  • Docs: Installation Guide, Examples

Highlighted Details

  • Supports INT4 inference on Intel GPUs (Data Center GPU Max Series, Arc A-Series) and CPUs (including Meteor Lake).
  • Features NeuralChat, a chatbot framework with plugins for retrieval, speech, caching, and security.
  • Offers OpenAI-compatible RESTful APIs for the NeuralChat server.
  • Provides optimized inference for various LLMs (e.g., Llama, Qwen, GPT-J) using weight-only quantization and supports CPU instruction sets like AMX, VNNI, AVX512F, AVX2.

Maintenance & Community

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Some features, like FP8 fine-tuning on Gaudi2 and INT8 inference on Intel Data Center GPU Max Series, are marked as "WIP" (Work in Progress).
  • GGUF format support is limited to Q4_0/Q5_0/Q8_0 variants.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.1%
3k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 5 hours ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.