Fast inference engine for Transformer models
Top 12.7% on sourcepulse
CTranslate2 is a C++ and Python library designed for fast and memory-efficient inference of Transformer models. It targets researchers and production systems needing to deploy models like BERT, GPT, Llama, and Whisper with optimized performance on both CPU and GPU. The library achieves significant speedups and reduced memory footprint through techniques like quantization, layer fusion, and dynamic memory management.
How It Works
CTranslate2 employs a custom runtime that integrates numerous performance optimizations. Key among these are weights quantization (FP16, BF16, INT8, INT4, AWQ), layer fusion to reduce kernel launch overhead, padding removal, batch reordering, and in-place operations. It supports multiple CPU architectures (x86-64, AArch64) with optimized backends (MKL, oneDNN, OpenBLAS, Ruy, Accelerate) and automatic runtime dispatching. For GPUs, it supports FP16 and INT8 precision, with options for tensor parallelism for distributed inference.
Quick Start & Requirements
pip install ctranslate2
Highlighted Details
Maintenance & Community
The project is actively maintained by the OpenNMT team. Community support is available via their forum and Gitter channel.
Licensing & Compatibility
CTranslate2 is released under the MIT License, permitting commercial use and integration with closed-source applications.
Limitations & Caveats
Models must be converted to CTranslate2's optimized format before inference. While backward compatibility is guaranteed for the core API, experimental features may change. Performance gains are dependent on the specific model architecture and hardware configuration.
3 months ago
1 day