android-bench  by android-bench

Benchmarking LLMs for Android development tasks

Created 2 months ago
276 stars

Top 93.8% on SourcePulse

GitHubView on GitHub
Project Summary

This framework benchmarks Large Language Models (LLMs) on Android development tasks, evaluating their ability to comprehend mobile codebases, generate accurate patches, and solve engineering problems. It provides tooling for researchers and engineers to assess AI models acting as Android developers, offering a standardized environment for code generation and verification against test suites using a curated dataset.

How It Works

The project employs a two-stage benchmarking process: an "Inference (Agent)" stage where the LLM reads an issue description and generates code modifications, followed by an "Evaluation (Verifier)" stage that applies the patch and runs tests to score the solution. It utilizes Docker containers for isolated task execution and a curated dataset to ensure a standardized and reproducible testing environment.

Quick Start & Requirements

  • Primary install/run command: Clone the repository, create and activate a Python virtual environment using uv venv, then run uv run setup_env to install dependencies, configure the oracle agent, and build Docker images. Tasks are run using run_task --model <model_name> --task <task_id>.
  • Prerequisites: x86_64 architecture with KVM capabilities, Python 3.14+, uv, Docker, and API keys for the models being benchmarked.
  • Resource footprint: Local Docker images can be disk and memory intensive, requiring potentially +40GB of free space per image (base, repo, task). Initial Docker image builds for tasks can take 5-10+ minutes.
  • Links: User Guide, Troubleshooting Guide.

Highlighted Details

  • Specialized LLM evaluation for Android development tasks, focusing on code understanding and patch generation.
  • Automated workflow involving an agent for code modification and a verifier for test-based evaluation.
  • Interactive dataset explorer (dataset browse) and HTML summary generation (results) for visualizing benchmark outcomes.
  • Support for models via LiteLLM, requiring environment variable configuration for API keys (e.g., GEMINI_API_KEY, OPENAI_API_KEY).

Maintenance & Community

The project is actively exploring community engagement and encourages feedback via the issue tracker. Specific contributing guidelines are available.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. This license is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Local execution on macOS with ARM64 architecture is severely limited due to Docker Desktop's lack of nested virtualization (KVM), requiring workarounds. Using local Docker images is noted as a v1 limitation and is both disk and memory intensive.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
4
Star History
30 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.