AgentHarness  by ApodexAI

Evaluation harness for deep research AI

Created 2 weeks ago

New!

301 stars

Top 88.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

AgentHarness is an open-source evaluation harness for reproducing public benchmark results of Apodex-1.0, a verification-centric deep research model. It targets researchers and engineers, providing a standard ReAct setup for evaluating AI agent performance on deep research tasks.

How It Works

The harness utilizes a standard ReAct (Reasoning and Acting) agent setup to evaluate models. It integrates with serving frameworks like SGLang and requires API keys for tools (web search, code execution) to facilitate standardized, reproducible testing of AI agent capabilities on deep research benchmarks.

Quick Start & Requirements

Installation uses uv and Python 3.12 (uv sync --python 3.12). Models are served via sglang (python3 -m sglang.launch_server ...), requiring model paths (e.g., apodex/Apodex-1.0-35B-A3B), tensor parallelism (tp), and parser configurations. Environment variables (.env) for API keys (OpenAI, Serper, Jina, E2B) are mandatory. Benchmark datasets are downloaded via wget and unzipped with password apodex*()_2026. Running benchmarks uses uv run python -m benchmarks.runner.run_subprocess. Links: Apodex-1.0-35B-A3B model, HLE dataset license.

Highlighted Details

Performance metrics show Apodex-1.0 variants achieving up to 71.5% on BrowseComp, 80.6% on BrowseComp-ZH, 46.8% on HLE-Text, and 82.2% on DeepSearchQA. The harness supports numerous benchmarks including BrowseComp, DeepSearchQA, and xbench-DeepResearch. Each benchmark question runs in an isolated subprocess for improved reproducibility and debugging.

Maintenance & Community

No specific details regarding maintainers, community channels, or roadmap were found in the provided README.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The Humanity's Last Exam (HLE) text benchmark is not included in the main dataset download due to its license prohibiting answer redistribution. Users must manually accept the HLE license and place the dataset file at benchmarks/datasets/HLE-text/standardized_data.jsonl to run this benchmark.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
301 stars in the last 20 days

Explore Similar Projects

Feedback? Help us improve.