meta-harness-tbench2-artifact by stanford-iris-lab

Agent scaffold for terminal LLM evaluation

Created 3 months ago

1,125 stars

Top 33.5% on SourcePulse

Project Summary

Summary

This project provides an agent scaffold, Meta-Harness, designed to enhance LLM agent performance within terminal environments, specifically targeting the Terminal-Bench 2.0 benchmark. It offers a significant benefit by reducing initial environment exploration time, making agents more efficient and effective for developers evaluating or deploying them in interactive command-line scenarios.

How It Works

Meta-Harness extends the Terminus-KIRA agent by implementing "environment bootstrapping." Prior to agent execution, it captures a snapshot of the sandbox environment, including the working directory, available tools, and system configurations. This snapshot is then injected into the agent's initial prompt, preemptively providing context that would otherwise require several exploration turns (e.g., ls, which python3), thereby accelerating agent setup and task initiation. The agent's discovery was facilitated through automated harness evolution.

Quick Start & Requirements

Primary install: pip install harbor
Prerequisites: ANTHROPIC_API_KEY environment variable.
Run command: harbor run --agent-import-path agent:AgentHarness -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 -e runloop -n 20 --n-attempts 5
Dependencies: Built upon KRAFTON AI's Terminus-KIRA and Harbor's Terminus-2 framework.
Links: No official quick-start or documentation links are provided in the README.

Highlighted Details

Achieves a 76.4% score on the Terminal-Bench 2.0 benchmark, evaluated across 89 tasks and 5 trials using Anthropic's Claude Opus 4.6 model.
Performance breakdown by difficulty: 100.0% on Easy, 81.1% on Medium, and 64.7% on Hard tasks.
Reduces agent's initial environment exploration turns by an estimated 2-5 commands.

Maintenance & Community

Acknowledgements: KRAFTON AI provided compute support.
No other details regarding contributors, community channels (e.g., Discord/Slack), or roadmap are present in the provided README.

Licensing & Compatibility

No license information is specified in the README. This absence poses a significant adoption blocker for commercial or closed-source integration.

Limitations & Caveats

The project states "More details coming soon," indicating that the current documentation is incomplete. Specific limitations regarding platform compatibility, unsupported features, or known bugs are not detailed.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

42 stars in the last 30 days