AI-Infra-Auto-Driven-SKILLS by BBuf

Agent-ready playbooks for AI infrastructure operations

Created 3 months ago

661 stars

Top 50.1% on SourcePulse

Project Summary

AI-Infra-Auto-Driven-SKILLS provides agent-ready playbooks for AI infrastructure engineers to automate LLM serving benchmarks, profiler triage, SGLang optimization, production incident debugging, and model PR intelligence. It equips agents with operational memory to perform complex tasks, aiming to reduce manual effort in performance tuning and incident resolution.

How It Works

This repository offers a collection of focused "skills" or playbooks designed for AI agents. The core approach emphasizes automation for critical AI infrastructure tasks. Key differentiators include a stage-separated profiler workflow that isolates prefill and decode evidence, a framework-neutral benchmark schema for consistent comparisons across serving frameworks (SGLang, vLLM, TensorRT-LLM), and a replay-first incident triage methodology that prioritizes evidence preservation and reproduction before code changes.

Quick Start & Requirements

Installation involves copying desired skills directly into an agent's skill directory (e.g., cp -r skills/llm-serving-auto-benchmark <agent-skill-dir>/llm-serving-auto-benchmark). No specific software prerequisites are detailed beyond the need for an agent environment capable of executing these Python-based skills. The H100 operator runbooks require specific remote environment configuration, including SSH aliases, container names, and workspace paths.

Highlighted Details

Features 8 core operational skills for benchmark search, profiler analysis, SOTA performance loops, incident triage, architecture diagrams, GPU kernels, and H100 runs.
Includes 58 model optimization runbooks for SGLang and vLLM, covering a wide array of model families like DeepSeek, Qwen, Llama, and Mistral.
Provides 58 PR history dossiers that track model evolution, detailing changes, risks, and upstream ideas.
Employs a stage-separated profiler workflow to distinguish prefill and decode evidence, preventing misattribution.
Utilizes a framework-neutral benchmark schema for fair comparisons across different serving frameworks.
Offers a profiler-to-action fusion catalog that links torch-profiler rows to known optimization patterns.
Implements replay-first incident triage to preserve evidence and reproduce issues before patching.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord or Slack), sponsorships, or roadmaps are present in the provided README.

Licensing & Compatibility

The README does not explicitly state a software license. This omission presents a significant caveat for potential adoption, especially for commercial use or integration into closed-source projects.

Limitations & Caveats

The H100-specific skills necessitate careful configuration of remote environments and adherence to security practices for handling secrets. The absence of a declared license is a primary limitation for widespread or commercial adoption.

AI-Infra-Auto-Driven-SKILLS by BBuf

Explore Similar Projects

AKO4ALL by TongmingLAIC

ForgeTrain by OpenBMB

Hy3-preview by Tencent-Hunyuan

flashinfer-bench by flashinfer-ai

PostTrainBench by aisa-group

Open-AgentRL by Gen-Verse

claw-eval by claw-eval

a-evolve by A-EVO-Lab

Awesome-LLM-Post-training by mbzuai-oryx

Qwen3.6 by QwenLM

InfraTech by CalvinXKY

ROLL by alibaba