meta-harness-tbench2-artifact  by stanford-iris-lab

Agent scaffold for terminal LLM evaluation

Created 1 week ago

New!

596 stars

Top 54.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project provides an agent scaffold, Meta-Harness, designed to enhance LLM agent performance within terminal environments, specifically targeting the Terminal-Bench 2.0 benchmark. It offers a significant benefit by reducing initial environment exploration time, making agents more efficient and effective for developers evaluating or deploying them in interactive command-line scenarios.

How It Works

Meta-Harness extends the Terminus-KIRA agent by implementing "environment bootstrapping." Prior to agent execution, it captures a snapshot of the sandbox environment, including the working directory, available tools, and system configurations. This snapshot is then injected into the agent's initial prompt, preemptively providing context that would otherwise require several exploration turns (e.g., ls, which python3), thereby accelerating agent setup and task initiation. The agent's discovery was facilitated through automated harness evolution.

Quick Start & Requirements

  • Primary install: pip install harbor
  • Prerequisites: ANTHROPIC_API_KEY environment variable.
  • Run command: harbor run --agent-import-path agent:AgentHarness -d terminal-bench@2.0 -m anthropic/claude-opus-4-6 -e runloop -n 20 --n-attempts 5
  • Dependencies: Built upon KRAFTON AI's Terminus-KIRA and Harbor's Terminus-2 framework.
  • Links: No official quick-start or documentation links are provided in the README.

Highlighted Details

  • Achieves a 76.4% score on the Terminal-Bench 2.0 benchmark, evaluated across 89 tasks and 5 trials using Anthropic's Claude Opus 4.6 model.
  • Performance breakdown by difficulty: 100.0% on Easy, 81.1% on Medium, and 64.7% on Hard tasks.
  • Reduces agent's initial environment exploration turns by an estimated 2-5 commands.

Maintenance & Community

  • Acknowledgements: KRAFTON AI provided compute support.
  • No other details regarding contributors, community channels (e.g., Discord/Slack), or roadmap are present in the provided README.

Licensing & Compatibility

  • No license information is specified in the README. This absence poses a significant adoption blocker for commercial or closed-source integration.

Limitations & Caveats

The project states "More details coming soon," indicating that the current documentation is incomplete. Specific limitations regarding platform compatibility, unsupported features, or known bugs are not detailed.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
607 stars in the last 10 days

Explore Similar Projects

Feedback? Help us improve.