BALROG by balrog-ai

Agentic LLM and VLM reasoning benchmark

Created 1 year ago

260 stars

Top 97.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chuan Li

Chief Scientific Officer at Lambda

Project Summary

BALROG provides a novel benchmark for evaluating agentic Large Language Model (LLM) and Vision-Language Model (VLM) capabilities on complex, long-horizon interactive tasks. It leverages reinforcement learning environments to assess reasoning and decision-making, offering a comprehensive framework for researchers and engineers. The project aims to standardize evaluation, enabling direct comparison of model performance and facilitating the development of more capable AI agents.

How It Works

BALROG employs a benchmark framework that integrates with various reinforcement learning environments. It supports both LLMs and VLMs, allowing for a broad assessment of agentic abilities. The core approach involves running agents within these environments, evaluating their performance on tasks requiring sequential decision-making and reasoning. This methodology is advantageous for its comprehensive nature, covering diverse interactive scenarios and supporting both text-based and multimodal agents.

Quick Start & Requirements

Installation is recommended via Conda:

conda create -n balrog python=3.10 -y
conda activate balrog
git clone https://github.com/balrog-ai/BALROG.git
cd BALROG
pip install -e .
balrog-post-install

Prerequisites include Python 3.10, Conda, and wget (on macOS). Local evaluation using vLLM requires vllm and numpy==1.23. API-based evaluation necessitates API keys for OpenAI, Anthropic, or Google Gemini, which can be set as environment variables or in a SECRETS file. Documentation and guides are available for evaluation, agent development, and few-shot learning.

Highlighted Details

Comprehensive evaluation of agentic abilities for LLMs and VLMs.
Direct integration with popular AI APIs (OpenAI, Anthropic, Google Gemini).
Support for local model deployment and evaluation via vLLM.
Facilitates easy integration of custom agents, new environments, and new models.

Maintenance & Community

Contribution guidelines are provided, encouraging community involvement. Specific details on community channels like Discord or Slack, or notable contributors/sponsorships, are not detailed in the provided README snippet.

Licensing & Compatibility

This project is licensed under the MIT License, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Specific setup instructions are noted for macOS users, including the need to install wget and set an environment variable (export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES) to mitigate potential fork() errors. The README does not explicitly detail alpha/beta status or known bugs.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days