Discover and explore top open-source AI tools and projects—updated daily.
AgnitraaiLLM inference optimization and trust layer
Top 48.7% on SourcePulse
Agnitra AI offers a Python SDK designed to optimize Large Language Model (LLM) inference for production environments, focusing on speed, cost reduction, and verifiable trust. It targets engineers and researchers seeking to enhance LLM deployment efficiency without the need for model retraining. The primary benefits include significant reductions in memory usage (up to 2x) and increases in throughput (1.5-2x), alongside cryptographically signed inference manifests for enhanced provenance.
How It Works
Agnitra AI functions as an optimization layer integrated via a single Python keyword, acting as an alternative to complex serving runtimes like vLLM or TensorRT-LLM. Its core mechanism involves automatic quantization (INT8, INT4, FP8, or an auto-selected best mode for the GPU) and other performance tuning passes, leveraging torchao. This SDK approach allows seamless integration into existing model.generate() workflows, optimizing inference with minimal code changes and providing honest passthrough for unsupported models.
Quick Start & Requirements
pip install "agnitra[quantize]" is recommended for core optimization features. Additional extras like [trust] are available.torch for quantization and optimization passes.import torch
from agnitra.integrations.huggingface import AgnitraModel
model = AgnitraModel.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
torch_dtype=torch.float16,
agnitra_kwargs={"input_shape": (1, 512), "quantize": "auto"},
).cuda()
# Use 'model' like a standard HuggingFace model
Highlighted Details
cache.agnitra.ai) stores and retrieves optimal inference configurations based on hardware and model architecture fingerprints, reducing calibration time.optimize, optimize-dir), model packaging, and trust verification (agnitra doctor, agnitra trust verify).Maintenance & Community
The project is actively maintained with clear channels for community engagement via GitHub Discussions and Issues. A detailed roadmap outlines future development, including expanded model support and advanced trust features.
Licensing & Compatibility
Limitations & Caveats
The library focuses exclusively on decoder-only LLMs and explicitly excludes encoder transformers, image generation, and speech models. It is not a standalone serving runtime and lacks features like paged KV cache or continuous batching, requiring pairing with runtimes such as vLLM. Agnitra AI performs single-GPU optimization only; multi-GPU sharding requires separate tools. LoRA fine-tunes are supported only after merging weights via peft.merge_and_unload().
2 weeks ago
Inactive
ModelTC
algorithmicsuperintelligence
Tiiny-AI