Discover and explore top open-source AI tools and projects—updated daily.
ApodexAIEvaluation harness for deep research AI
New!
Top 88.2% on SourcePulse
Summary
AgentHarness is an open-source evaluation harness for reproducing public benchmark results of Apodex-1.0, a verification-centric deep research model. It targets researchers and engineers, providing a standard ReAct setup for evaluating AI agent performance on deep research tasks.
How It Works
The harness utilizes a standard ReAct (Reasoning and Acting) agent setup to evaluate models. It integrates with serving frameworks like SGLang and requires API keys for tools (web search, code execution) to facilitate standardized, reproducible testing of AI agent capabilities on deep research benchmarks.
Quick Start & Requirements
Installation uses uv and Python 3.12 (uv sync --python 3.12). Models are served via sglang (python3 -m sglang.launch_server ...), requiring model paths (e.g., apodex/Apodex-1.0-35B-A3B), tensor parallelism (tp), and parser configurations. Environment variables (.env) for API keys (OpenAI, Serper, Jina, E2B) are mandatory. Benchmark datasets are downloaded via wget and unzipped with password apodex*()_2026. Running benchmarks uses uv run python -m benchmarks.runner.run_subprocess. Links: Apodex-1.0-35B-A3B model, HLE dataset license.
Highlighted Details
Performance metrics show Apodex-1.0 variants achieving up to 71.5% on BrowseComp, 80.6% on BrowseComp-ZH, 46.8% on HLE-Text, and 82.2% on DeepSearchQA. The harness supports numerous benchmarks including BrowseComp, DeepSearchQA, and xbench-DeepResearch. Each benchmark question runs in an isolated subprocess for improved reproducibility and debugging.
Maintenance & Community
No specific details regarding maintainers, community channels, or roadmap were found in the provided README.
Licensing & Compatibility
The project is licensed under the Apache 2.0 license, which is permissive for commercial use and integration into closed-source projects.
Limitations & Caveats
The Humanity's Last Exam (HLE) text benchmark is not included in the main dataset download due to its license prohibiting answer redistribution. Users must manually accept the HLE license and place the dataset file at benchmarks/datasets/HLE-text/standardized_data.jsonl to run this benchmark.
2 weeks ago
Inactive