Discover and explore top open-source AI tools and projects—updated daily.
ryanssennCustom C++ LLM inference engine for local text completion
Top 93.5% on SourcePulse
Torchless is a custom-built LLM inference engine implemented entirely in C/C++ from scratch. It targets engineers and power users seeking a lightweight, fast, and transparent runtime for large language models, currently demonstrating local text completion for Mistral 7B on CPU. The project's primary benefit is providing a foundational, hand-coded engine for understanding and optimizing LLM inference performance without external dependencies like PyTorch.
How It Works
Torchless employs a ground-up approach, starting with a Python script (export_mistral.py) to convert Hugging Face model weights into a single, standardized binary file. This binary is then memory-mapped by the C++ engine for efficient loading. The inference process involves BPE tokenization of input prompts into integer IDs, followed by a transformer loop. This loop processes token IDs through embedding, 32 layers incorporating RMSNorm, Grouped-Query Attention (GQA) with Rotary Positional Embeddings (RoPE) and a KV cache, and a SwiGLU feed-forward network. Finally, an LM head projects the output to predict the next token ID, which is decoded back to text. This architecture prioritizes speed and minimal overhead by avoiding high-level frameworks.
Quick Start & Requirements
curl).git, cmake, curl.ryanssenn/torchless and the Mistral model repository. Download json.hpp. Optionally set up a Python virtual environment and install requirements (pip install -r requirements.txt). Export the model using python3 export_mistral.py. Compile the C++ code via cmake and cmake --build../torchless <path_to_mistral.bin> "<your_prompt>".https://huggingface.co/mistralai/Mistral-7B-v0.1https://github.com/ryanssenn/torchlessHighlighted Details
export_mistral.py) for creating a single binary, with support for f32 quantization.Maintenance & Community
The project encourages users to open GitHub issues for environment-specific problems. No other community channels (like Discord/Slack) or explicit roadmap details beyond current development status are mentioned.
Licensing & Compatibility
The repository's README does not explicitly state a software license. This omission requires further investigation for any commercial or derivative use.
Limitations & Caveats
The project is currently CPU-focused, with CUDA kernel support listed as a future goal. SIMD optimizations and CPU multithreading are marked as "Todo." Advanced quantization like fp8 is planned but not yet implemented. A terminal chat interface is also a future development item. The project is actively under development, with ongoing work on performance optimizations.
1 day ago
Inactive
snowflakedb
monatis
SafeAILab