torchless by ryanssenn

Custom C++ LLM inference engine for local text completion

Created 9 months ago

278 stars

Top 93.6% on SourcePulse

Project Summary

Torchless is a custom-built LLM inference engine implemented entirely in C/C++ from scratch. It targets engineers and power users seeking a lightweight, fast, and transparent runtime for large language models, currently demonstrating local text completion for Mistral 7B on CPU. The project's primary benefit is providing a foundational, hand-coded engine for understanding and optimizing LLM inference performance without external dependencies like PyTorch.

How It Works

Torchless employs a ground-up approach, starting with a Python script (export_mistral.py) to convert Hugging Face model weights into a single, standardized binary file. This binary is then memory-mapped by the C++ engine for efficient loading. The inference process involves BPE tokenization of input prompts into integer IDs, followed by a transformer loop. This loop processes token IDs through embedding, 32 layers incorporating RMSNorm, Grouped-Query Attention (GQA) with Rotary Positional Embeddings (RoPE) and a KV cache, and a SwiGLU feed-forward network. Finally, an LM head projects the output to predict the next token ID, which is decoded back to text. This architecture prioritizes speed and minimal overhead by avoiding high-level frameworks.

Quick Start & Requirements

Primary install/run: Compile the C++ project using CMake, then execute the compiled binary.
Prerequisites:
- Mistral 7B v0.1 model weights (downloaded from Hugging Face).
- nhlohmann JSON library (downloaded via curl).
- Python 3 for the export script and dependency management.
- git, cmake, curl.
Setup: Clone ryanssenn/torchless and the Mistral model repository. Download json.hpp. Optionally set up a Python virtual environment and install requirements (pip install -r requirements.txt). Export the model using python3 export_mistral.py. Compile the C++ code via cmake and cmake --build.
Run: Execute ./torchless <path_to_mistral.bin> "<your_prompt>".
Links:
- Mistral 7B Model: https://huggingface.co/mistralai/Mistral-7B-v0.1
- Torchless GitHub: https://github.com/ryanssenn/torchless

Highlighted Details

Fully custom-built LLM inference engine in C/C++, developed entirely from scratch.
Supports Mistral 7B model inference on CPU with text completion.
Includes a model export script (export_mistral.py) for creating a single binary, with support for f32 quantization.
Implements core LLM components: BPE tokenizer, RMSNorm, RoPE, KV Cache, GQA, SwiGLU MLP, and basic text generation (greedy decoding, multinomial sampling).
Features parity tests to validate inference components against Hugging Face implementations.

Maintenance & Community

The project encourages users to open GitHub issues for environment-specific problems. No other community channels (like Discord/Slack) or explicit roadmap details beyond current development status are mentioned.

Licensing & Compatibility

The repository's README does not explicitly state a software license. This omission requires further investigation for any commercial or derivative use.

Limitations & Caveats

The project is currently CPU-focused, with CUDA kernel support listed as a future goal. SIMD optimizations and CPU multithreading are marked as "Todo." Advanced quantization like fp8 is planned but not yet implemented. A terminal chat interface is also a future development item. The project is actively under development, with ongoing work on performance optimizations.

torchless by ryanssenn

Explore Similar Projects

LLaDA2.X by inclusionAI

EXAONE-Deep by LG-AI-EXAONE

simple-llm by naklecha

gerbil by lone-cloud

ArcticInference by snowflakedb

clip.cpp by monatis

lightning-thunder by Lightning-AI

codeshell by WisdomShell

ComfyUI-nunchaku by nunchaku-ai

EAGLE by SafeAILab

onnxruntime-genai by microsoft

Awesome-LLM-Inference by xlite-dev