CTranslate2 by OpenNMT

Fast inference engine for Transformer models

Created 6 years ago

4,238 stars

Top 11.5% on SourcePulse

View on GitHub

11 Experts Love This Project

Lianmin Zheng

Coauthor of SGLang, vLLM

and 7 more!

Project Summary

CTranslate2 is a C++ and Python library designed for fast and memory-efficient inference of Transformer models. It targets researchers and production systems needing to deploy models like BERT, GPT, Llama, and Whisper with optimized performance on both CPU and GPU. The library achieves significant speedups and reduced memory footprint through techniques like quantization, layer fusion, and dynamic memory management.

How It Works

CTranslate2 employs a custom runtime that integrates numerous performance optimizations. Key among these are weights quantization (FP16, BF16, INT8, INT4, AWQ), layer fusion to reduce kernel launch overhead, padding removal, batch reordering, and in-place operations. It supports multiple CPU architectures (x86-64, AArch64) with optimized backends (MKL, oneDNN, OpenBLAS, Ruy, Accelerate) and automatic runtime dispatching. For GPUs, it supports FP16 and INT8 precision, with options for tensor parallelism for distributed inference.

Quick Start & Requirements

Install via pip: pip install ctranslate2
Requires Python. GPU support requires CUDA.
Official documentation: https://opennmt.github.io/CTranslate2/
Forum: https://forum.opennmt.net/
Gitter: https://gitter.im/OpenNMT/CTranslate2

Highlighted Details

Achieves up to 10x faster inference and 4x memory reduction compared to standard frameworks like TensorFlow and PyTorch on CPU and GPU.
Supports a wide range of Transformer architectures including Encoder-Decoder, Decoder-only, and Encoder-only models.
Offers advanced decoding features like autocompletion and alternative sequence generation.
Includes converters for popular frameworks like OpenNMT-py, Fairseq, Marian, and Hugging Face Transformers.

Maintenance & Community

The project is actively maintained by the OpenNMT team. Community support is available via their forum and Gitter channel.

Licensing & Compatibility

CTranslate2 is released under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Models must be converted to CTranslate2's optimized format before inference. While backward compatibility is guaranteed for the core API, experimental features may change. Performance gains are dependent on the specific model architecture and hardware configuration.

Health Check

Last Commit

9 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

57 stars in the last 30 days