torchless  by ryanssenn

Custom C++ LLM inference engine for local text completion

Created 7 months ago
278 stars

Top 93.5% on SourcePulse

GitHubView on GitHub
Project Summary

Torchless is a custom-built LLM inference engine implemented entirely in C/C++ from scratch. It targets engineers and power users seeking a lightweight, fast, and transparent runtime for large language models, currently demonstrating local text completion for Mistral 7B on CPU. The project's primary benefit is providing a foundational, hand-coded engine for understanding and optimizing LLM inference performance without external dependencies like PyTorch.

How It Works

Torchless employs a ground-up approach, starting with a Python script (export_mistral.py) to convert Hugging Face model weights into a single, standardized binary file. This binary is then memory-mapped by the C++ engine for efficient loading. The inference process involves BPE tokenization of input prompts into integer IDs, followed by a transformer loop. This loop processes token IDs through embedding, 32 layers incorporating RMSNorm, Grouped-Query Attention (GQA) with Rotary Positional Embeddings (RoPE) and a KV cache, and a SwiGLU feed-forward network. Finally, an LM head projects the output to predict the next token ID, which is decoded back to text. This architecture prioritizes speed and minimal overhead by avoiding high-level frameworks.

Quick Start & Requirements

  • Primary install/run: Compile the C++ project using CMake, then execute the compiled binary.
  • Prerequisites:
    • Mistral 7B v0.1 model weights (downloaded from Hugging Face).
    • nhlohmann JSON library (downloaded via curl).
    • Python 3 for the export script and dependency management.
    • git, cmake, curl.
  • Setup: Clone ryanssenn/torchless and the Mistral model repository. Download json.hpp. Optionally set up a Python virtual environment and install requirements (pip install -r requirements.txt). Export the model using python3 export_mistral.py. Compile the C++ code via cmake and cmake --build.
  • Run: Execute ./torchless <path_to_mistral.bin> "<your_prompt>".
  • Links:
    • Mistral 7B Model: https://huggingface.co/mistralai/Mistral-7B-v0.1
    • Torchless GitHub: https://github.com/ryanssenn/torchless

Highlighted Details

  • Fully custom-built LLM inference engine in C/C++, developed entirely from scratch.
  • Supports Mistral 7B model inference on CPU with text completion.
  • Includes a model export script (export_mistral.py) for creating a single binary, with support for f32 quantization.
  • Implements core LLM components: BPE tokenizer, RMSNorm, RoPE, KV Cache, GQA, SwiGLU MLP, and basic text generation (greedy decoding, multinomial sampling).
  • Features parity tests to validate inference components against Hugging Face implementations.

Maintenance & Community

The project encourages users to open GitHub issues for environment-specific problems. No other community channels (like Discord/Slack) or explicit roadmap details beyond current development status are mentioned.

Licensing & Compatibility

The repository's README does not explicitly state a software license. This omission requires further investigation for any commercial or derivative use.

Limitations & Caveats

The project is currently CPU-focused, with CUDA kernel support listed as a future goal. SIMD optimizations and CPU multithreading are marked as "Todo." Advanced quantization like fp8 is planned but not yet implemented. A terminal chat interface is also a future development item. The project is actively under development, with ongoing work on performance optimizations.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
1 more.

ArcticInference by snowflakedb

1.7%
367
vLLM plugin for high-throughput, low-latency LLM and embedding inference
Created 9 months ago
Updated 5 days ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.9%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.