CTranslate2  by OpenNMT

Fast inference engine for Transformer models

Created 6 years ago
4,496 stars

Top 10.9% on SourcePulse

GitHubView on GitHub
Project Summary

CTranslate2 is a C++ and Python library designed for fast and memory-efficient inference of Transformer models. It targets researchers and production systems needing to deploy models like BERT, GPT, Llama, and Whisper with optimized performance on both CPU and GPU. The library achieves significant speedups and reduced memory footprint through techniques like quantization, layer fusion, and dynamic memory management.

How It Works

CTranslate2 employs a custom runtime that integrates numerous performance optimizations. Key among these are weights quantization (FP16, BF16, INT8, INT4, AWQ), layer fusion to reduce kernel launch overhead, padding removal, batch reordering, and in-place operations. It supports multiple CPU architectures (x86-64, AArch64) with optimized backends (MKL, oneDNN, OpenBLAS, Ruy, Accelerate) and automatic runtime dispatching. For GPUs, it supports FP16 and INT8 precision, with options for tensor parallelism for distributed inference.

Quick Start & Requirements

Highlighted Details

  • Achieves up to 10x faster inference and 4x memory reduction compared to standard frameworks like TensorFlow and PyTorch on CPU and GPU.
  • Supports a wide range of Transformer architectures including Encoder-Decoder, Decoder-only, and Encoder-only models.
  • Offers advanced decoding features like autocompletion and alternative sequence generation.
  • Includes converters for popular frameworks like OpenNMT-py, Fairseq, Marian, and Hugging Face Transformers.

Maintenance & Community

The project is actively maintained by the OpenNMT team. Community support is available via their forum and Gitter channel.

Licensing & Compatibility

CTranslate2 is released under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Models must be converted to CTranslate2's optimized format before inference. While backward compatibility is guaranteed for the core API, experimental features may change. Performance gains are dependent on the specific model architecture and hardware configuration.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
14
Issues (30d)
10
Star History
54 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.8%
5k
High-performance C++ LLM inference library
Created 3 years ago
Updated 10 hours ago
Feedback? Help us improve.