OpenJudge by agentscope-ai

AI application evaluation and quality rewards framework

Created 6 months ago

307 stars

Top 87.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Eric Zhu

Coauthor of AutoGen; Research Scientist at Microsoft Research

Project Summary

OpenJudge is an open-source evaluation framework designed for AI applications like agents and chatbots. It streamlines the process of assessing quality and driving continuous optimization by providing a unified workflow for data collection, grading, evaluation, and analysis. The framework aims to simplify, professionalize, and integrate application evaluation, ultimately enhancing AI application excellence.

How It Works

The framework supports a systematic evaluation workflow: collect test data, define graders, run evaluations at scale, analyze weaknesses, and iterate. It offers a comprehensive library of production-ready graders and multiple flexible methods for building custom graders, including zero-shot rubric generation from task descriptions, data-driven rubric generation from examples, and training dedicated judge models. This approach allows for adaptable and robust quality assessment, converting grading results into reward signals for fine-tuning applications.

Quick Start & Requirements

Primary install: pip install py-openjudge
Prerequisites: Python 3.10+. Examples utilize asyncio and require LLM API keys (e.g., OpenAI).
Links:
- Documentation: https://agentscope-ai.github.io/OpenJudge/
- Quickstart Guide: Available via documentation.

Highlighted Details

Grader Library: Over 50 production-ready graders across General, Agent, and Multimodal domains, validated with benchmark datasets and pytest.
Multi-Scenario Coverage: Supports diverse domains including text, code, math, agents, and multimodal tasks.
Holistic Agent Evaluation: Assesses the entire agent lifecycle, including trajectories, memory, reflection, and tool use.
Flexible Grader Building: Supports customization via Python/prompts, zero-shot/data-driven rubric generation, and training custom judge models.
Integrations: Seamlessly connects with observability platforms (LangSmith, Langfuse) and training frameworks (VERL).

Maintenance & Community

Developed by "The OpenJudge Team".
Community engagement via a DingTalk group.
Recent news includes releases (v0.2.0) and research publications on reward modeling and AI feedback alignment.

Licensing & Compatibility

License: Not explicitly stated in the provided README.
Compatibility: Integrations suggest broad compatibility with common LLM observability and training platforms.

Limitations & Caveats

Version 0.2.0 is not backward compatible with the legacy v0.1.x package (rm-gallery), requiring migration of imports and usage. The project appears to be actively developing, with "Planned" integrations indicating ongoing feature expansion.

Health Check

Last Commit

17 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

198 stars in the last 30 days