GitTaskBench by QuantaAlpha

Code agent benchmark for real-world repository tasks

Created 9 months ago

255 stars

Top 98.8% on SourcePulse

Project Summary

A benchmark and tooling suite for evaluating code agents on real-world, repository-level tasks. GitTaskBench addresses the gap in existing benchmarks by focusing on tasks requiring comprehensive understanding and utilization of full-scale GitHub repositories, offering a more authentic assessment of agent capabilities for developers and researchers.

How It Works

GitTaskBench evaluates LLM agents on 54 representative tasks with real-world economic value, each mapped to a fixed GitHub repository. This approach mirrors how developers solve complex problems using existing open-source projects. The benchmark systematically assesses an agent's ability to leverage repository code, focusing on "Execution Completion Rate" and "Task Pass Rate" with task-specific, predefined metrics.

Quick Start & Requirements

Primary install: Clone the repository, create a conda environment (conda create -n gittaskbench python=3.10 -y), activate it (conda activate gittaskbench), install specific PyTorch versions (pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113), then install GitTaskBench (cd GitTaskBench && pip install -e . or pip install -r requirements.txt).
Prerequisites: Python 3.10, PyTorch with CUDA 11.3 support, torchvision, torchaudio.
Running: Single task evaluation: gittaskbench grade --taskid <taskid>. All tasks evaluation: gittaskbench grade --all. Results analysis: gittaskbench eval.
Links: Repo, OpenHands Configuration Guide, SWE-Agent Configuration Guide, Aider Configuration Guide.

Highlighted Details

Multi-Modal Support: Encompasses vision, language, audio, time-series, and web-based data.
Diverse Task Types: Features generation, recognition, enhancement, analysis, and simulation tasks across 7 domains including Image, Video, Speech, Physiological Signals, Security, Web Scraping, and Office Document Processing.
Real-World Relevance: Tasks are derived from practical applications and possess real-world economic value.
Agent Framework Integration: Provides integration guidelines for state-of-the-art agent frameworks like OpenHands, SWE-Agent, and Aider.
Cost-Aware Metric: Includes a cost-aware α metric for evaluation.

Maintenance & Community

Founded by academics from Tsinghua University, Peking University, CAS, CMU, and HKUST, the project welcomes community contributions for bug fixes, new features, documentation, and test cases. No explicit community channels (e.g., Discord, Slack) are listed.

Licensing & Compatibility

The repository's README does not specify a license. This omission requires clarification for commercial use or closed-source integration.

Limitations & Caveats

The README does not detail specific limitations, known bugs, or unsupported platforms. The installation instructions use a placeholder URL (your-org/GitTaskBench.git) for cloning, which may require adjustment. The specific PyTorch version requirement suggests a potential need for older CUDA toolkits.

GitTaskBench by QuantaAlpha

Explore Similar Projects

deep-swe by datacurve-ai

the-startup by rsmdt

Toolathlon by hkust-nlp

claw-eval by claw-eval

AgentCPM by OpenBMB

SWE-bench_Pro-os by scaleapi

AgentLite by SalesforceAIResearch

terminal-bench-2 by harbor-framework

AgentGym by WooooDyy

skill by pinchbench

OSWorld by xlang-ai

agentops by AgentOps-AI