Discover and explore top open-source AI tools and projects—updated daily.
android-benchBenchmarking LLMs for Android development tasks
Top 93.8% on SourcePulse
This framework benchmarks Large Language Models (LLMs) on Android development tasks, evaluating their ability to comprehend mobile codebases, generate accurate patches, and solve engineering problems. It provides tooling for researchers and engineers to assess AI models acting as Android developers, offering a standardized environment for code generation and verification against test suites using a curated dataset.
How It Works
The project employs a two-stage benchmarking process: an "Inference (Agent)" stage where the LLM reads an issue description and generates code modifications, followed by an "Evaluation (Verifier)" stage that applies the patch and runs tests to score the solution. It utilizes Docker containers for isolated task execution and a curated dataset to ensure a standardized and reproducible testing environment.
Quick Start & Requirements
uv venv, then run uv run setup_env to install dependencies, configure the oracle agent, and build Docker images. Tasks are run using run_task --model <model_name> --task <task_id>.Highlighted Details
dataset browse) and HTML summary generation (results) for visualizing benchmark outcomes.GEMINI_API_KEY, OPENAI_API_KEY).Maintenance & Community
The project is actively exploring community engagement and encourages feedback via the issue tracker. Specific contributing guidelines are available.
Licensing & Compatibility
Licensed under the Apache License, Version 2.0. This license is generally permissive for commercial use and integration into closed-source projects.
Limitations & Caveats
Local execution on macOS with ARM64 architecture is severely limited due to Docker Desktop's lack of nested virtualization (KVM), requiring workarounds. Using local Docker images is noted as a v1 limitation and is both disk and memory intensive.
1 week ago
Inactive
groq
TheAgentCompany
openai