GitTaskBench  by QuantaAlpha

Code agent benchmark for real-world repository tasks

Created 6 months ago
251 stars

Top 99.9% on SourcePulse

GitHubView on GitHub
Project Summary

A benchmark and tooling suite for evaluating code agents on real-world, repository-level tasks. GitTaskBench addresses the gap in existing benchmarks by focusing on tasks requiring comprehensive understanding and utilization of full-scale GitHub repositories, offering a more authentic assessment of agent capabilities for developers and researchers.

How It Works

GitTaskBench evaluates LLM agents on 54 representative tasks with real-world economic value, each mapped to a fixed GitHub repository. This approach mirrors how developers solve complex problems using existing open-source projects. The benchmark systematically assesses an agent's ability to leverage repository code, focusing on "Execution Completion Rate" and "Task Pass Rate" with task-specific, predefined metrics.

Quick Start & Requirements

  • Primary install: Clone the repository, create a conda environment (conda create -n gittaskbench python=3.10 -y), activate it (conda activate gittaskbench), install specific PyTorch versions (pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113), then install GitTaskBench (cd GitTaskBench && pip install -e . or pip install -r requirements.txt).
  • Prerequisites: Python 3.10, PyTorch with CUDA 11.3 support, torchvision, torchaudio.
  • Running: Single task evaluation: gittaskbench grade --taskid <taskid>. All tasks evaluation: gittaskbench grade --all. Results analysis: gittaskbench eval.
  • Links: Repo, OpenHands Configuration Guide, SWE-Agent Configuration Guide, Aider Configuration Guide.

Highlighted Details

  • Multi-Modal Support: Encompasses vision, language, audio, time-series, and web-based data.
  • Diverse Task Types: Features generation, recognition, enhancement, analysis, and simulation tasks across 7 domains including Image, Video, Speech, Physiological Signals, Security, Web Scraping, and Office Document Processing.
  • Real-World Relevance: Tasks are derived from practical applications and possess real-world economic value.
  • Agent Framework Integration: Provides integration guidelines for state-of-the-art agent frameworks like OpenHands, SWE-Agent, and Aider.
  • Cost-Aware Metric: Includes a cost-aware α metric for evaluation.

Maintenance & Community

Founded by academics from Tsinghua University, Peking University, CAS, CMU, and HKUST, the project welcomes community contributions for bug fixes, new features, documentation, and test cases. No explicit community channels (e.g., Discord, Slack) are listed.

Licensing & Compatibility

The repository's README does not specify a license. This omission requires clarification for commercial use or closed-source integration.

Limitations & Caveats

The README does not detail specific limitations, known bugs, or unsupported platforms. The installation instructions use a placeholder URL (your-org/GitTaskBench.git) for cloning, which may require adjustment. The specific PyTorch version requirement suggests a potential need for older CUDA toolkits.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.