agentic-harness-engineering  by china-qijizhifeng

Observability-driven evolution for coding agents

Created 1 month ago
441 stars

Top 67.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

Agentic Harness Engineering (AHE) is an open observability system for automatically evolving coding-agent harnesses around a fixed base model. It targets researchers and engineers seeking to enhance agent performance by optimizing system prompts, tool descriptions, implementations, and middleware. AHE significantly boosts agent capabilities, demonstrated by high benchmark pass rates and harnesses that generalize across models.

How It Works

AHE uses an iterative evaluate-analyze-improve loop driven by three observability layers: component tracking (git), experience distillation (Agent Debugger processing traces), and decision support (Evolve Agent proposing evidence-backed edits). Harness components like prompts, tools, and skills are refined based on trace analysis. Each iteration's evaluation falsifies predictions, guiding further refinement and encoding general engineering experience.

Quick Start & Requirements

Requires Python ≥ 3.13, uv, and tmux. Installation: git clone, uv sync. Configure environment variables for LLM/sandbox API keys (e.g., LLM_API_KEY, E2B_API_KEY). Experiments run in E2B sandboxes (SaaS/self-hosted). Pre-build E2B templates: uv run python scripts/build_templates.py --dataset-dir /path/to/dataset -j 16. Launch experiments via ./scripts/evolve.sh configs/experiments/exp-003-simple-code-gpt54.yaml. Datasets can be local paths or referenced via dataset: "<name>@<ver>".

Highlighted Details

  • AHE (GPT-5.5) ranked #3 on Terminal-Bench 2.0 (84.7% pass@1).
  • Lifts GPT-5.4 Terminal-Bench 2 pass@1 from 69.7% to 77.0% over 10 iterations.
  • Surpasses hand-written Codex (71.9%) and self-evolving ACE/TF-GRPO baselines.
  • Frozen, evolved harnesses transfer to SWE-bench-Verified and four alternate base models, indicating generalized learning.

Maintenance & Community

No specific details regarding maintainers, community channels, sponsorships, or active development signals were found in the provided README content.

Licensing & Compatibility

Released under the MIT license, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The Agent Debugger is only partially open-sourced. SaaS E2B sandbox users must manage concurrent sandbox limits to avoid stalling experiments.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
440 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.