OpenJudge  by agentscope-ai

AI application evaluation and quality rewards framework

Created 6 months ago
307 stars

Top 87.6% on SourcePulse

GitHubView on GitHub
Project Summary

OpenJudge is an open-source evaluation framework designed for AI applications like agents and chatbots. It streamlines the process of assessing quality and driving continuous optimization by providing a unified workflow for data collection, grading, evaluation, and analysis. The framework aims to simplify, professionalize, and integrate application evaluation, ultimately enhancing AI application excellence.

How It Works

The framework supports a systematic evaluation workflow: collect test data, define graders, run evaluations at scale, analyze weaknesses, and iterate. It offers a comprehensive library of production-ready graders and multiple flexible methods for building custom graders, including zero-shot rubric generation from task descriptions, data-driven rubric generation from examples, and training dedicated judge models. This approach allows for adaptable and robust quality assessment, converting grading results into reward signals for fine-tuning applications.

Quick Start & Requirements

  • Primary install: pip install py-openjudge
  • Prerequisites: Python 3.10+. Examples utilize asyncio and require LLM API keys (e.g., OpenAI).
  • Links:
    • Documentation: https://agentscope-ai.github.io/OpenJudge/
    • Quickstart Guide: Available via documentation.

Highlighted Details

  • Grader Library: Over 50 production-ready graders across General, Agent, and Multimodal domains, validated with benchmark datasets and pytest.
  • Multi-Scenario Coverage: Supports diverse domains including text, code, math, agents, and multimodal tasks.
  • Holistic Agent Evaluation: Assesses the entire agent lifecycle, including trajectories, memory, reflection, and tool use.
  • Flexible Grader Building: Supports customization via Python/prompts, zero-shot/data-driven rubric generation, and training custom judge models.
  • Integrations: Seamlessly connects with observability platforms (LangSmith, Langfuse) and training frameworks (VERL).

Maintenance & Community

  • Developed by "The OpenJudge Team".
  • Community engagement via a DingTalk group.
  • Recent news includes releases (v0.2.0) and research publications on reward modeling and AI feedback alignment.

Licensing & Compatibility

  • License: Not explicitly stated in the provided README.
  • Compatibility: Integrations suggest broad compatibility with common LLM observability and training platforms.

Limitations & Caveats

Version 0.2.0 is not backward compatible with the legacy v0.1.x package (rm-gallery), requiring migration of imports and usage. The project appears to be actively developing, with "Planned" integrations indicating ongoing feature expansion.

Health Check
Last Commit

17 hours ago

Responsiveness

Inactive

Pull Requests (30d)
59
Issues (30d)
5
Star History
198 stars in the last 30 days

Explore Similar Projects

Starred by Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
3 more.

Trace by microsoft

0.3%
707
AutoDiff-like tool for end-to-end AI agent training with general feedback
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.