evals-skills  by hamelsmu

AI evaluation skills for robust LLM testing

Created 1 week ago

New!

722 stars

Top 47.6% on SourcePulse

GitHubView on GitHub
Project Summary

AI Evals Skills provides a set of pre-built functionalities, termed "skills," designed to enhance the process of evaluating AI systems, particularly Large Language Models (LLMs). Targeted at engineers and product managers involved in AI development, these skills aim to automate the identification and remediation of common pitfalls encountered during LLM evaluation, thereby improving the reliability and quality of AI outputs and saving development time.

How It Works

The project offers a plugin system that integrates with AI coding agents, such as those within Claude Code or accessible via the Skills CLI. Users install specific skills, which are essentially modular instructions that guide the AI agent through complex evaluation tasks. The core approach leverages common mistakes observed across numerous LLM evaluation projects, encoding them into reusable skills. This allows agents to systematically audit pipelines, analyze errors, generate synthetic data, design judge prompts, validate evaluators, assess RAG systems, and build review interfaces, offering a structured and efficient path to robust evaluations.

Quick Start & Requirements

Installation varies by environment:

  • Claude Code: Register the plugin repository with /plugin marketplace add hamelsmu/evals-skills, then install with /plugin install evals-skills@hamelsmu-evals-skills. Restart Claude Code after installation. Skills are invoked via /evals-skills:<skill-name>.
  • Skills CLI (npx): Install the entire repository using npx skills add https://github.com/hamelsmu/evals-skills, or a single skill like eval-audit with --skill eval-audit. Check for updates with npx skills check and npx skills update. No specific hardware or software prerequisites beyond the respective CLI tools are detailed.

Highlighted Details

  • eval-audit: Audits eval pipelines, surfacing problems with prioritized severity.
  • error-analysis: Guides users through reading traces and categorizing failures.
  • generate-synthetic-data: Creates diverse synthetic test inputs using dimension-based tuple generation.
  • write-judge-prompt: Designs LLM-as-Judge evaluators for subjective quality criteria.
  • validate-evaluator: Calibrates LLM judges against human labels using data splits, TPR/TNR, and bias correction.
  • evaluate-rag: Evaluates retrieval and generation quality in RAG pipelines.
  • build-review-interface: Builds custom annotation interfaces for human trace review.

Maintenance & Community

The README does not specify maintainers, community channels (like Discord/Slack), or a public roadmap. Support and further learning are implied through the associated "AI Evals For Engineers & PMs" course.

Licensing & Compatibility

The specific open-source license is not stated in the provided README. Consequently, notes regarding commercial use or compatibility with closed-source projects are absent.

Limitations & Caveats

These skills represent a starting point, addressing only common, generalizable mistakes in LLM evaluations. For optimal performance, users are encouraged to develop custom skills tailored to their specific technology stack, domain, and data. Furthermore, the project does not encompass broader aspects of evaluation workflows such as production monitoring, CI/CD integration, or in-depth data analysis.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
0
Star History
724 stars in the last 12 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jared Palmer Jared Palmer(SVP at GitHub; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

human-eval by openai

0.1%
3k
Evaluation harness for LLMs trained on code
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.