arc-agi-benchmarking by arcprize

CLI tool for benchmarking LLMs on ARC-AGI tasks

Created 1 year ago

330 stars

Top 83.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Travis Fischer

Founder of Agentic

Project Summary

This repository provides a framework for benchmarking Large Language Models (LLMs) on the ARC-AGI (Abstraction and Reasoning Corpus - Artificial General Intelligence) dataset. It enables researchers and developers to systematically evaluate and compare the performance of various LLMs across different configurations and tasks within the ARC-AGI benchmark.

How It Works

The framework utilizes a modular adapter system to interface with different LLM providers. Users define model configurations, including provider details, model names, and API parameters, in a models.yml file. The core execution script (main.py) takes a data directory, a model configuration, and optionally a specific task ID, to run predictions. It supports single-task testing, batch processing with concurrency (using parallel), and submission management for uploading results to Hugging Face.

Quick Start & Requirements

Install: git clone https://github.com/arcprizeorg/model_baseline.git followed by pip install -r requirements.txt.
Prerequisites: Python 3.x, git, parallel (optional, for concurrency).
Documentation: ARC Prize

Highlighted Details

Supports testing ARC-AGI-1 and ARC-AGI-2 tasks.
Includes CLI tools for validating model outputs and uploading submissions to Hugging Face.
Facilitates adding new LLM providers and models via a configuration file (models.yml) and adapter pattern.
Allows fine-grained testing of models with different configurations (e.g., temperature, max tokens).
Provides a test_providers.sh script for validating new provider implementations.

Maintenance & Community

Contributions are welcome, particularly for adding new model adapters.
Further information is available via the ARC Prize.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking requires clarification.

Limitations & Caveats

The specific license is not detailed, which may impact commercial adoption.
Authentication for Hugging Face uploads requires manual setup via environment variables or huggingface-cli login.

Health Check

Last Commit

1 week ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days