Agnitraai  by Agnitraai

LLM inference optimization and trust layer

Created 8 months ago
693 stars

Top 48.7% on SourcePulse

GitHubView on GitHub
Project Summary

Agnitra AI offers a Python SDK designed to optimize Large Language Model (LLM) inference for production environments, focusing on speed, cost reduction, and verifiable trust. It targets engineers and researchers seeking to enhance LLM deployment efficiency without the need for model retraining. The primary benefits include significant reductions in memory usage (up to 2x) and increases in throughput (1.5-2x), alongside cryptographically signed inference manifests for enhanced provenance.

How It Works

Agnitra AI functions as an optimization layer integrated via a single Python keyword, acting as an alternative to complex serving runtimes like vLLM or TensorRT-LLM. Its core mechanism involves automatic quantization (INT8, INT4, FP8, or an auto-selected best mode for the GPU) and other performance tuning passes, leveraging torchao. This SDK approach allows seamless integration into existing model.generate() workflows, optimizing inference with minimal code changes and providing honest passthrough for unsupported models.

Quick Start & Requirements

  • Installation: pip install "agnitra[quantize]" is recommended for core optimization features. Additional extras like [trust] are available.
  • Core Dependency: Requires torch for quantization and optimization passes.
  • Usage Example:
    import torch
    from agnitra.integrations.huggingface import AgnitraModel
    model = AgnitraModel.from_pretrained(
        "microsoft/Phi-3-mini-4k-instruct",
        torch_dtype=torch.float16,
        agnitra_kwargs={"input_shape": (1, 512), "quantize": "auto"},
    ).cuda()
    # Use 'model' like a standard HuggingFace model
    
  • Links: Quickstart, Integrations, Benchmarks.

Highlighted Details

  • Automatic Quantization: Supports INT8, INT4, and FP8 weight quantization, with an "auto" mode selecting the optimal configuration for the target GPU.
  • Trust & Provenance: Generates cryptographically signed inference manifests (Ed25519) detailing model hashes, applied optimizations, and runtime context, essential for regulated deployments.
  • Cross-Customer Cache: An opt-in shared cache (cache.agnitra.ai) stores and retrieves optimal inference configurations based on hardware and model architecture fingerprints, reducing calibration time.
  • CLI Tool: Provides a command-line interface for optimization (optimize, optimize-dir), model packaging, and trust verification (agnitra doctor, agnitra trust verify).

Maintenance & Community

The project is actively maintained with clear channels for community engagement via GitHub Discussions and Issues. A detailed roadmap outlines future development, including expanded model support and advanced trust features.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Designed for seamless integration with popular frameworks including HuggingFace Transformers, LangChain, LlamaIndex, and Accelerate. Compatible with commercial use under the Apache 2.0 license.

Limitations & Caveats

The library focuses exclusively on decoder-only LLMs and explicitly excludes encoder transformers, image generation, and speech models. It is not a standalone serving runtime and lacks features like paged KV cache or continuous batching, requiring pairing with runtimes such as vLLM. Agnitra AI performs single-GPU optimization only; multi-GPU sharding requires separate tools. LoRA fine-tunes are supported only after merging weights via peft.merge_and_unload().

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
40
Issues (30d)
0
Star History
702 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LightLLM by ModelTC

0.2%
4k
Python framework for LLM inference and serving
Created 2 years ago
Updated 11 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), and
11 more.

optillm by algorithmicsuperintelligence

2.0%
4k
Optimizing inference proxy for LLMs
Created 1 year ago
Updated 2 weeks ago
Feedback? Help us improve.