CLI tool for benchmarking LLMs on ARC-AGI tasks
Top 91.6% on sourcepulse
This repository provides a framework for benchmarking Large Language Models (LLMs) on the ARC-AGI (Abstraction and Reasoning Corpus - Artificial General Intelligence) dataset. It enables researchers and developers to systematically evaluate and compare the performance of various LLMs across different configurations and tasks within the ARC-AGI benchmark.
How It Works
The framework utilizes a modular adapter system to interface with different LLM providers. Users define model configurations, including provider details, model names, and API parameters, in a models.yml
file. The core execution script (main.py
) takes a data directory, a model configuration, and optionally a specific task ID, to run predictions. It supports single-task testing, batch processing with concurrency (using parallel
), and submission management for uploading results to Hugging Face.
Quick Start & Requirements
git clone https://github.com/arcprizeorg/model_baseline.git
followed by pip install -r requirements.txt
.git
, parallel
(optional, for concurrency).Highlighted Details
models.yml
) and adapter pattern.test_providers.sh
script for validating new provider implementations.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
huggingface-cli login
.1 week ago
Inactive