AI-Infra-Auto-Driven-SKILLS  by BBuf

Agent-ready playbooks for AI infrastructure operations

Created 1 month ago
413 stars

Top 70.4% on SourcePulse

GitHubView on GitHub
Project Summary

AI-Infra-Auto-Driven-SKILLS provides agent-ready playbooks for AI infrastructure engineers to automate LLM serving benchmarks, profiler triage, SGLang optimization, production incident debugging, and model PR intelligence. It equips agents with operational memory to perform complex tasks, aiming to reduce manual effort in performance tuning and incident resolution.

How It Works

This repository offers a collection of focused "skills" or playbooks designed for AI agents. The core approach emphasizes automation for critical AI infrastructure tasks. Key differentiators include a stage-separated profiler workflow that isolates prefill and decode evidence, a framework-neutral benchmark schema for consistent comparisons across serving frameworks (SGLang, vLLM, TensorRT-LLM), and a replay-first incident triage methodology that prioritizes evidence preservation and reproduction before code changes.

Quick Start & Requirements

Installation involves copying desired skills directly into an agent's skill directory (e.g., cp -r skills/llm-serving-auto-benchmark <agent-skill-dir>/llm-serving-auto-benchmark). No specific software prerequisites are detailed beyond the need for an agent environment capable of executing these Python-based skills. The H100 operator runbooks require specific remote environment configuration, including SSH aliases, container names, and workspace paths.

Highlighted Details

  • Features 8 core operational skills for benchmark search, profiler analysis, SOTA performance loops, incident triage, architecture diagrams, GPU kernels, and H100 runs.
  • Includes 58 model optimization runbooks for SGLang and vLLM, covering a wide array of model families like DeepSeek, Qwen, Llama, and Mistral.
  • Provides 58 PR history dossiers that track model evolution, detailing changes, risks, and upstream ideas.
  • Employs a stage-separated profiler workflow to distinguish prefill and decode evidence, preventing misattribution.
  • Utilizes a framework-neutral benchmark schema for fair comparisons across different serving frameworks.
  • Offers a profiler-to-action fusion catalog that links torch-profiler rows to known optimization patterns.
  • Implements replay-first incident triage to preserve evidence and reproduce issues before patching.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord or Slack), sponsorships, or roadmaps are present in the provided README.

Licensing & Compatibility

The README does not explicitly state a software license. This omission presents a significant caveat for potential adoption, especially for commercial use or integration into closed-source projects.

Limitations & Caveats

The H100-specific skills necessitate careful configuration of remote environments and adherence to security practices for handling secrets. The absence of a declared license is a primary limitation for widespread or commercial adoption.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
23
Issues (30d)
2
Star History
276 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
3 more.

ROLL by alibaba

0.5%
3k
RL library for large language models
Created 1 year ago
Updated 19 hours ago
Feedback? Help us improve.