android-bench by android-bench

Benchmarking LLMs for Android development tasks

Created 4 months ago

290 stars

Top 90.6% on SourcePulse

Project Summary

This framework benchmarks Large Language Models (LLMs) on Android development tasks, evaluating their ability to comprehend mobile codebases, generate accurate patches, and solve engineering problems. It provides tooling for researchers and engineers to assess AI models acting as Android developers, offering a standardized environment for code generation and verification against test suites using a curated dataset.

How It Works

The project employs a two-stage benchmarking process: an "Inference (Agent)" stage where the LLM reads an issue description and generates code modifications, followed by an "Evaluation (Verifier)" stage that applies the patch and runs tests to score the solution. It utilizes Docker containers for isolated task execution and a curated dataset to ensure a standardized and reproducible testing environment.

Quick Start & Requirements

Primary install/run command: Clone the repository, create and activate a Python virtual environment using uv venv, then run uv run setup_env to install dependencies, configure the oracle agent, and build Docker images. Tasks are run using run_task --model <model_name> --task <task_id>.
Prerequisites: x86_64 architecture with KVM capabilities, Python 3.14+, uv, Docker, and API keys for the models being benchmarked.
Resource footprint: Local Docker images can be disk and memory intensive, requiring potentially +40GB of free space per image (base, repo, task). Initial Docker image builds for tasks can take 5-10+ minutes.
Links: User Guide, Troubleshooting Guide.

Highlighted Details

Specialized LLM evaluation for Android development tasks, focusing on code understanding and patch generation.
Automated workflow involving an agent for code modification and a verifier for test-based evaluation.
Interactive dataset explorer (dataset browse) and HTML summary generation (results) for visualizing benchmark outcomes.
Support for models via LiteLLM, requiring environment variable configuration for API keys (e.g., GEMINI_API_KEY, OPENAI_API_KEY).

Maintenance & Community

The project is actively exploring community engagement and encourages feedback via the issue tracker. Specific contributing guidelines are available.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. This license is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Local execution on macOS with ARM64 architecture is severely limited due to Docker Desktop's lack of nested virtualization (KVM), requiring workarounds. Using local Docker images is noted as a v1 limitation and is both disk and memory intensive.

android-bench by android-bench

Explore Similar Projects

Android-Lab by THUDM

rhesis by rhesis-ai

finalrun-agent by droid-ash

live-swe-agent by OpenAutoCoder

munk-ai by chaxiu

Evaluator by NVIDIA-NeMo

SWE-bench_Pro-os by scaleapi

openbench by groq

TheAgentCompany by TheAgentCompany

SWELancer-Benchmark by openai

MaaFramework by MaaXYZ

phoenix by Arize-ai