atlas  by Avarok-Cybersecurity

High-performance LLM inference engine in pure Rust

Created 2 weeks ago

New!

399 stars

Top 72.1% on SourcePulse

GitHubView on GitHub
Project Summary

Atlas is a pure Rust LLM inference engine designed to provide high-performance, stable, and cost-effective local inference, addressing the dependency hell and ecosystem instability common in Python-based engines. It targets engineers and researchers seeking to run powerful LLMs locally without premium cloud API costs, offering significant speedups through hardware-specific optimizations and advanced techniques.

How It Works

Atlas employs a monorepo architecture in Rust for enhanced stability and community contribution. Its core innovation lies in hardware- and model-specific kernels, meticulously tuned for each combination to maximize performance, reportedly achieving 2-3x speedups. The system features a plug-and-play design with well-defined abstraction boundaries (traits) for models, layers, GPU backends, communication, and storage, enabling modularity and extensibility. An HTTP server interfaces with a scheduler that orchestrates batched decoding, speculative execution, and sampling, dispatching computations to hardware-specific CUDA kernels.

Quick Start & Requirements

The project provides a Docker image (avarok/atlas-gb10:latest) pre-compiled for NVIDIA GB10 hardware (SM121).

  • Primary Install: docker pull avarok/atlas-gb10:latest
  • Prerequisites: NVIDIA GPU (GB10 target), Docker, HuggingFace cache mounted into the container.
  • Run Command Example:
    sudo docker run -d --name atlas \
      --network host --gpus all --ipc=host \
      -v ~/.cache/huggingface:/root/.cache/huggingface \
      avarok/atlas-gb10:latest \
      serve Qwen/Qwen3.6-35B-A3B-FP8 \
        --port 8888 \
        --max-seq-len 65536 \
        --kv-cache-dtype fp8 \
        --gpu-memory-utilization 0.90 \
        --speculative
    
  • Documentation: Detailed recipes and build instructions are available in QUICKSTART.md and CONTRIBUTING.md.

Highlighted Details

  • Supports multiple quantization formats for KV cache (BF16, FP8, Turbo8, NVFP4, Turbo4, Turbo3) to balance memory usage and precision.
  • Achieves competitive performance, outperforming vLLM on specific benchmarks (e.g., Qwen3.5-35B-A3B with MTP speculative decoding on GB10).
  • Extensible architecture allows adding new hardware targets, models, communication backends, and storage backends via trait implementations.
  • Features like speculative decoding (MTP), paged KV cache, and optional NVMe offload (--high-speed-swap) for long contexts.

Maintenance & Community

Atlas emphasizes a "Community-First" philosophy, encouraging contributions via its Discord server. The monorepo design aims to facilitate meaningful PRs, including AI-generated ones. The project actively integrates research from papers and welcomes community efforts to expand hardware and model support.

Licensing & Compatibility

The project uses a dual-license model:

  • Community Edition: AGPLv3, permitting use for personal, research, and non-commercial hosted projects, but requiring source disclosure for derivative works.
  • Enterprise Edition: Commercial license available for closed-source products, SaaS backends, or support relationships.

Limitations & Caveats

The primary limitation is that the provided Docker image is pre-compiled and optimized specifically for NVIDIA GB10 hardware. While the architecture is designed for extensibility, adding support for other hardware (e.g., AMD, Apple Silicon, Intel) or new models requires significant community or commercial effort. Some advanced KV cache quantization methods (e.g., turbo3) are noted as experimental.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
73
Issues (30d)
18
Star History
401 stars in the last 19 days

Explore Similar Projects

Feedback? Help us improve.