arc-agi-benchmarking  by arcprize

CLI tool for benchmarking LLMs on ARC-AGI tasks

created 9 months ago
291 stars

Top 91.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a framework for benchmarking Large Language Models (LLMs) on the ARC-AGI (Abstraction and Reasoning Corpus - Artificial General Intelligence) dataset. It enables researchers and developers to systematically evaluate and compare the performance of various LLMs across different configurations and tasks within the ARC-AGI benchmark.

How It Works

The framework utilizes a modular adapter system to interface with different LLM providers. Users define model configurations, including provider details, model names, and API parameters, in a models.yml file. The core execution script (main.py) takes a data directory, a model configuration, and optionally a specific task ID, to run predictions. It supports single-task testing, batch processing with concurrency (using parallel), and submission management for uploading results to Hugging Face.

Quick Start & Requirements

  • Install: git clone https://github.com/arcprizeorg/model_baseline.git followed by pip install -r requirements.txt.
  • Prerequisites: Python 3.x, git, parallel (optional, for concurrency).
  • Documentation: ARC Prize

Highlighted Details

  • Supports testing ARC-AGI-1 and ARC-AGI-2 tasks.
  • Includes CLI tools for validating model outputs and uploading submissions to Hugging Face.
  • Facilitates adding new LLM providers and models via a configuration file (models.yml) and adapter pattern.
  • Allows fine-grained testing of models with different configurations (e.g., temperature, max tokens).
  • Provides a test_providers.sh script for validating new provider implementations.

Maintenance & Community

  • Contributions are welcome, particularly for adding new model adapters.
  • Further information is available via the ARC Prize.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking requires clarification.

Limitations & Caveats

  • The specific license is not detailed, which may impact commercial adoption.
  • Authentication for Hugging Face uploads requires manual setup via environment variables or huggingface-cli login.
Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
34 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.