dotLLM  by kkokosa

Natively built .NET LLM inference engine

Created 2 months ago
402 stars

Top 71.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

dotLLM provides a high-performance LLM inference engine natively implemented in C#/.NET, targeting developers seeking efficient LLM integration. It bypasses Python/C++ wrappers, offering significant speedups via SIMD-optimized CPU and CUDA GPU backends, making advanced LLM capabilities accessible within the .NET ecosystem.

How It Works

This engine is built ground-up in pure C#, leveraging System.Runtime.Intrinsics for SIMD CPU operations and PTX kernels via the CUDA Driver API for GPU acceleration. It supports transformer models like Llama, Mistral, and Phi. Key design choices include zero-GC inference using unmanaged memory, memory-mapped GGUF loading for millisecond model startup, and a modular NuGet package structure. This native approach avoids the overhead of inter-process communication or foreign function interfaces.

Quick Start & Requirements

Installation options:

  • Global .NET tool: dotnet tool install -g DotLLM.Cli --prerelease.
  • Self-contained binaries from releases.
  • NuGet packages for .NET applications.

Requires .NET 10 runtime. GPU acceleration needs an NVIDIA GPU and CUDA Toolkit. Python 3.10+ is used for scripts. Website: https://dotllm.dev/ Documentation: docs/ Roadmap: docs/ROADMAP.md Discussions: https://github.com/kkokosa/dotLLM/discussions

Highlighted Details

  • Zero-GC Inference: Uses unmanaged memory (NativeMemory.AlignedAlloc) for tensors, avoiding managed heap allocations.
  • Performance: Features SIMD vectorization, fused operators, and efficient attention mechanisms.
  • Rapid Loading: Memory-mapped GGUF loading enables multi-GB models to load in milliseconds.
  • Model Support: Handles Llama, Mistral, Phi, Qwen, DeepSeek with various quantizations (FP16, Q8_0, Q4_K_M).
  • OpenAI-Compatible API: Includes an ASP.NET Core server with a chat UI.
  • Advanced Features: Supports paged KV-cache, speculative decoding, and constrained decoding (JSON, schema, regex, grammar).
  • Hybrid Compute: Offers CPU/GPU layer offloading and NUMA/P-core aware threading.

Maintenance & Community

Authored by .NET MVP Konrad Kokosa. Community engagement via GitHub Discussions. A detailed roadmap is available.

Licensing & Compatibility

Licensed under GNU General Public License v3.0 (GPL v3). This strong copyleft license may impact commercial use or integration into proprietary software.

Limitations & Caveats

Native AOT builds are experimental. Speculative decoding is currently greedy-only. GPL v3 license has copyleft implications. Continuous batching and advanced scheduling are planned for future releases.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
29
Issues (30d)
35
Star History
404 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

JittorLLMs by Jittor

0.0%
2k
Low-resource LLM inference library
Created 3 years ago
Updated 1 year ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.6%
7k
LLM inference engine for blazing fast performance
Created 2 years ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anil Dash Anil Dash(Former CEO of Glitch), and
23 more.

llamafile by mozilla-ai

0.3%
24k
Single-file LLM distribution and runtime via `llama.cpp` and Cosmopolitan Libc
Created 2 years ago
Updated 4 days ago
Feedback? Help us improve.