Agnitraai by Agnitraai

LLM inference optimization and trust layer

Created 10 months ago

693 stars

Top 48.3% on SourcePulse

Project Summary

Agnitra AI offers a Python SDK designed to optimize Large Language Model (LLM) inference for production environments, focusing on speed, cost reduction, and verifiable trust. It targets engineers and researchers seeking to enhance LLM deployment efficiency without the need for model retraining. The primary benefits include significant reductions in memory usage (up to 2x) and increases in throughput (1.5-2x), alongside cryptographically signed inference manifests for enhanced provenance.

How It Works

Agnitra AI functions as an optimization layer integrated via a single Python keyword, acting as an alternative to complex serving runtimes like vLLM or TensorRT-LLM. Its core mechanism involves automatic quantization (INT8, INT4, FP8, or an auto-selected best mode for the GPU) and other performance tuning passes, leveraging torchao. This SDK approach allows seamless integration into existing model.generate() workflows, optimizing inference with minimal code changes and providing honest passthrough for unsupported models.

Quick Start & Requirements

Installation: pip install "agnitra[quantize]" is recommended for core optimization features. Additional extras like [trust] are available.
Core Dependency: Requires torch for quantization and optimization passes.

Usage Example:

import torch
from agnitra.integrations.huggingface import AgnitraModel
model = AgnitraModel.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    torch_dtype=torch.float16,
    agnitra_kwargs={"input_shape": (1, 512), "quantize": "auto"},
).cuda()
# Use 'model' like a standard HuggingFace model

Links: Quickstart, Integrations, Benchmarks.

Highlighted Details

Automatic Quantization: Supports INT8, INT4, and FP8 weight quantization, with an "auto" mode selecting the optimal configuration for the target GPU.
Trust & Provenance: Generates cryptographically signed inference manifests (Ed25519) detailing model hashes, applied optimizations, and runtime context, essential for regulated deployments.
Cross-Customer Cache: An opt-in shared cache (cache.agnitra.ai) stores and retrieves optimal inference configurations based on hardware and model architecture fingerprints, reducing calibration time.
CLI Tool: Provides a command-line interface for optimization (optimize, optimize-dir), model packaging, and trust verification (agnitra doctor, agnitra trust verify).

Maintenance & Community

The project is actively maintained with clear channels for community engagement via GitHub Discussions and Issues. A detailed roadmap outlines future development, including expanded model support and advanced trust features.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Designed for seamless integration with popular frameworks including HuggingFace Transformers, LangChain, LlamaIndex, and Accelerate. Compatible with commercial use under the Apache 2.0 license.

Limitations & Caveats

The library focuses exclusively on decoder-only LLMs and explicitly excludes encoder transformers, image generation, and speech models. It is not a standalone serving runtime and lacks features like paged KV cache or continuous batching, requiring pairing with runtimes such as vLLM. Agnitra AI performs single-GPU optimization only; multi-GPU sharding requires separate tools. LoRA fine-tunes are supported only after merging weights via peft.merge_and_unload().

Agnitraai by Agnitraai

Explore Similar Projects

Awesome-KV-Cache-Management by TreeAI-Lab

vllm-swift by TheTom

flex-nano-vllm by changjonathanc

ScaleLLM by vectorch-ai

llama2.rs by srush

ssd by tanishqkumar

InferLLM by MegEngine

sonar by dphnAI

LightLLM by ModelTC

exllamav2 by turboderp-org

optillm by algorithmicsuperintelligence

PowerInfer by Tiiny-AI