OpenJudge  by agentscope-ai

AI application evaluation and quality rewards framework

Created 9 months ago
537 stars

Top 59.0% on SourcePulse

GitHubView on GitHub
Project Summary

OpenJudge is an open-source evaluation framework designed for AI applications like agents and chatbots. It streamlines the process of assessing quality and driving continuous optimization by providing a unified workflow for data collection, grading, evaluation, and analysis. The framework aims to simplify, professionalize, and integrate application evaluation, ultimately enhancing AI application excellence.

How It Works

The framework supports a systematic evaluation workflow: collect test data, define graders, run evaluations at scale, analyze weaknesses, and iterate. It offers a comprehensive library of production-ready graders and multiple flexible methods for building custom graders, including zero-shot rubric generation from task descriptions, data-driven rubric generation from examples, and training dedicated judge models. This approach allows for adaptable and robust quality assessment, converting grading results into reward signals for fine-tuning applications.

Quick Start & Requirements

  • Primary install: pip install py-openjudge
  • Prerequisites: Python 3.10+. Examples utilize asyncio and require LLM API keys (e.g., OpenAI).
  • Links:
    • Documentation: https://agentscope-ai.github.io/OpenJudge/
    • Quickstart Guide: Available via documentation.

Highlighted Details

  • Grader Library: Over 50 production-ready graders across General, Agent, and Multimodal domains, validated with benchmark datasets and pytest.
  • Multi-Scenario Coverage: Supports diverse domains including text, code, math, agents, and multimodal tasks.
  • Holistic Agent Evaluation: Assesses the entire agent lifecycle, including trajectories, memory, reflection, and tool use.
  • Flexible Grader Building: Supports customization via Python/prompts, zero-shot/data-driven rubric generation, and training custom judge models.
  • Integrations: Seamlessly connects with observability platforms (LangSmith, Langfuse) and training frameworks (VERL).

Maintenance & Community

  • Developed by "The OpenJudge Team".
  • Community engagement via a DingTalk group.
  • Recent news includes releases (v0.2.0) and research publications on reward modeling and AI feedback alignment.

Licensing & Compatibility

  • License: Not explicitly stated in the provided README.
  • Compatibility: Integrations suggest broad compatibility with common LLM observability and training platforms.

Limitations & Caveats

Version 0.2.0 is not backward compatible with the legacy v0.1.x package (rm-gallery), requiring migration of imports and usage. The project appears to be actively developing, with "Planned" integrations indicating ongoing feature expansion.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
19
Issues (30d)
0
Star History
84 stars in the last 30 days

Explore Similar Projects

Starred by Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
3 more.

Trace by microsoft

2.0%
729
AutoDiff-like tool for end-to-end AI agent training with general feedback
Created 1 year ago
Updated 4 months ago
Feedback? Help us improve.