atlas by Avarok-Cybersecurity

High-performance LLM inference engine in pure Rust

Created 2 months ago

581 stars

Top 55.0% on SourcePulse

Project Summary

Atlas is a pure Rust LLM inference engine designed to provide high-performance, stable, and cost-effective local inference, addressing the dependency hell and ecosystem instability common in Python-based engines. It targets engineers and researchers seeking to run powerful LLMs locally without premium cloud API costs, offering significant speedups through hardware-specific optimizations and advanced techniques.

How It Works

Atlas employs a monorepo architecture in Rust for enhanced stability and community contribution. Its core innovation lies in hardware- and model-specific kernels, meticulously tuned for each combination to maximize performance, reportedly achieving 2-3x speedups. The system features a plug-and-play design with well-defined abstraction boundaries (traits) for models, layers, GPU backends, communication, and storage, enabling modularity and extensibility. An HTTP server interfaces with a scheduler that orchestrates batched decoding, speculative execution, and sampling, dispatching computations to hardware-specific CUDA kernels.

Quick Start & Requirements

The project provides a Docker image (avarok/atlas-gb10:latest) pre-compiled for NVIDIA GB10 hardware (SM121).

Primary Install: docker pull avarok/atlas-gb10:latest
Prerequisites: NVIDIA GPU (GB10 target), Docker, HuggingFace cache mounted into the container.

Run Command Example:

sudo docker run -d --name atlas \
  --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --port 8888 \
    --max-seq-len 65536 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.90 \
    --speculative

Documentation: Detailed recipes and build instructions are available in QUICKSTART.md and CONTRIBUTING.md.

Highlighted Details

Supports multiple quantization formats for KV cache (BF16, FP8, Turbo8, NVFP4, Turbo4, Turbo3) to balance memory usage and precision.
Achieves competitive performance, outperforming vLLM on specific benchmarks (e.g., Qwen3.5-35B-A3B with MTP speculative decoding on GB10).
Extensible architecture allows adding new hardware targets, models, communication backends, and storage backends via trait implementations.
Features like speculative decoding (MTP), paged KV cache, and optional NVMe offload (--high-speed-swap) for long contexts.

Maintenance & Community

Atlas emphasizes a "Community-First" philosophy, encouraging contributions via its Discord server. The monorepo design aims to facilitate meaningful PRs, including AI-generated ones. The project actively integrates research from papers and welcomes community efforts to expand hardware and model support.

Licensing & Compatibility

The project uses a dual-license model:

Community Edition: AGPLv3, permitting use for personal, research, and non-commercial hosted projects, but requiring source disclosure for derivative works.
Enterprise Edition: Commercial license available for closed-source products, SaaS backends, or support relationships.

Limitations & Caveats

The primary limitation is that the provided Docker image is pre-compiled and optimized specifically for NVIDIA GB10 hardware. While the architecture is designed for extensibility, adding support for other hardware (e.g., AMD, Apple Silicon, Intel) or new models requires significant community or commercial effort. Some advanced KV cache quantization methods (e.g., turbo3) are noted as experimental.

Health Check

Last Commit

13 hours ago

Responsiveness

Inactive

Pull Requests (30d)

113

Issues (30d)

Star History

96 stars in the last 30 days