CTranslate2  by OpenNMT

Fast inference engine for Transformer models

Created 6 years ago
4,022 stars

Top 12.2% on SourcePulse

GitHubView on GitHub
Project Summary

CTranslate2 is a C++ and Python library designed for fast and memory-efficient inference of Transformer models. It targets researchers and production systems needing to deploy models like BERT, GPT, Llama, and Whisper with optimized performance on both CPU and GPU. The library achieves significant speedups and reduced memory footprint through techniques like quantization, layer fusion, and dynamic memory management.

How It Works

CTranslate2 employs a custom runtime that integrates numerous performance optimizations. Key among these are weights quantization (FP16, BF16, INT8, INT4, AWQ), layer fusion to reduce kernel launch overhead, padding removal, batch reordering, and in-place operations. It supports multiple CPU architectures (x86-64, AArch64) with optimized backends (MKL, oneDNN, OpenBLAS, Ruy, Accelerate) and automatic runtime dispatching. For GPUs, it supports FP16 and INT8 precision, with options for tensor parallelism for distributed inference.

Quick Start & Requirements

Highlighted Details

  • Achieves up to 10x faster inference and 4x memory reduction compared to standard frameworks like TensorFlow and PyTorch on CPU and GPU.
  • Supports a wide range of Transformer architectures including Encoder-Decoder, Decoder-only, and Encoder-only models.
  • Offers advanced decoding features like autocompletion and alternative sequence generation.
  • Includes converters for popular frameworks like OpenNMT-py, Fairseq, Marian, and Hugging Face Transformers.

Maintenance & Community

The project is actively maintained by the OpenNMT team. Community support is available via their forum and Gitter channel.

Licensing & Compatibility

CTranslate2 is released under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Models must be converted to CTranslate2's optimized format before inference. While backward compatibility is guaranteed for the core API, experimental features may change. Performance gains are dependent on the specific model architecture and hardware configuration.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
60 stars in the last 30 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
2 more.

local-gemma by huggingface

0.3%
376
CLI tool for local Gemma-2 inference
Created 1 year ago
Updated 1 year ago
Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
4 more.

ktransformers by kvcache-ai

0.3%
15k
Framework for LLM inference optimization experimentation
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.