Discover and explore top open-source AI tools and projects—updated daily.
hamelsmuAI evaluation skills for robust LLM testing
New!
Top 47.6% on SourcePulse
AI Evals Skills provides a set of pre-built functionalities, termed "skills," designed to enhance the process of evaluating AI systems, particularly Large Language Models (LLMs). Targeted at engineers and product managers involved in AI development, these skills aim to automate the identification and remediation of common pitfalls encountered during LLM evaluation, thereby improving the reliability and quality of AI outputs and saving development time.
How It Works
The project offers a plugin system that integrates with AI coding agents, such as those within Claude Code or accessible via the Skills CLI. Users install specific skills, which are essentially modular instructions that guide the AI agent through complex evaluation tasks. The core approach leverages common mistakes observed across numerous LLM evaluation projects, encoding them into reusable skills. This allows agents to systematically audit pipelines, analyze errors, generate synthetic data, design judge prompts, validate evaluators, assess RAG systems, and build review interfaces, offering a structured and efficient path to robust evaluations.
Quick Start & Requirements
Installation varies by environment:
/plugin marketplace add hamelsmu/evals-skills, then install with /plugin install evals-skills@hamelsmu-evals-skills. Restart Claude Code after installation. Skills are invoked via /evals-skills:<skill-name>.npx skills add https://github.com/hamelsmu/evals-skills, or a single skill like eval-audit with --skill eval-audit. Check for updates with npx skills check and npx skills update.
No specific hardware or software prerequisites beyond the respective CLI tools are detailed.Highlighted Details
eval-audit: Audits eval pipelines, surfacing problems with prioritized severity.error-analysis: Guides users through reading traces and categorizing failures.generate-synthetic-data: Creates diverse synthetic test inputs using dimension-based tuple generation.write-judge-prompt: Designs LLM-as-Judge evaluators for subjective quality criteria.validate-evaluator: Calibrates LLM judges against human labels using data splits, TPR/TNR, and bias correction.evaluate-rag: Evaluates retrieval and generation quality in RAG pipelines.build-review-interface: Builds custom annotation interfaces for human trace review.Maintenance & Community
The README does not specify maintainers, community channels (like Discord/Slack), or a public roadmap. Support and further learning are implied through the associated "AI Evals For Engineers & PMs" course.
Licensing & Compatibility
The specific open-source license is not stated in the provided README. Consequently, notes regarding commercial use or compatibility with closed-source projects are absent.
Limitations & Caveats
These skills represent a starting point, addressing only common, generalizable mistakes in LLM evaluations. For optimal performance, users are encouraged to develop custom skills tailored to their specific technology stack, domain, and data. Furthermore, the project does not encompass broader aspects of evaluation workflows such as production monitoring, CI/CD integration, or in-depth data analysis.
1 week ago
Inactive
braintrustdata
openai
comet-ml