CTranslate2  by OpenNMT

Fast inference engine for Transformer models

created 5 years ago
3,934 stars

Top 12.7% on sourcepulse

GitHubView on GitHub
Project Summary

CTranslate2 is a C++ and Python library designed for fast and memory-efficient inference of Transformer models. It targets researchers and production systems needing to deploy models like BERT, GPT, Llama, and Whisper with optimized performance on both CPU and GPU. The library achieves significant speedups and reduced memory footprint through techniques like quantization, layer fusion, and dynamic memory management.

How It Works

CTranslate2 employs a custom runtime that integrates numerous performance optimizations. Key among these are weights quantization (FP16, BF16, INT8, INT4, AWQ), layer fusion to reduce kernel launch overhead, padding removal, batch reordering, and in-place operations. It supports multiple CPU architectures (x86-64, AArch64) with optimized backends (MKL, oneDNN, OpenBLAS, Ruy, Accelerate) and automatic runtime dispatching. For GPUs, it supports FP16 and INT8 precision, with options for tensor parallelism for distributed inference.

Quick Start & Requirements

Highlighted Details

  • Achieves up to 10x faster inference and 4x memory reduction compared to standard frameworks like TensorFlow and PyTorch on CPU and GPU.
  • Supports a wide range of Transformer architectures including Encoder-Decoder, Decoder-only, and Encoder-only models.
  • Offers advanced decoding features like autocompletion and alternative sequence generation.
  • Includes converters for popular frameworks like OpenNMT-py, Fairseq, Marian, and Hugging Face Transformers.

Maintenance & Community

The project is actively maintained by the OpenNMT team. Community support is available via their forum and Gitter channel.

Licensing & Compatibility

CTranslate2 is released under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Models must be converted to CTranslate2's optimized format before inference. While backward compatibility is guaranteed for the core API, experimental features may change. Performance gains are dependent on the specific model architecture and hardware configuration.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
8
Issues (30d)
6
Star History
173 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 10 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
18 more.

unsloth by unslothai

1.0%
43k
Finetuning tool for LLMs, targeting speed and memory efficiency
created 1 year ago
updated 4 days ago
Feedback? Help us improve.