InfiniteBench by OpenBMB

Benchmark for evaluating language models on super-long contexts (100k+ tokens)

Created 2 years ago

369 stars

Top 76.5% on SourcePulse

View on GitHub

3 Experts Love This Project

DevRel at Google DeepMind

Project Summary

InfiniteBench provides a benchmark for evaluating Large Language Models (LLMs) on their ability to process and reason over contexts exceeding 100,000 tokens. It targets researchers and developers working on long-context LLMs, offering a standardized method to assess performance degradation solely due to context length across diverse tasks.

How It Works

InfiniteBench comprises 12 distinct tasks, a mix of real-world and synthetic scenarios, designed to test various aspects of language processing in extended contexts. Tasks are specifically chosen to be solvable at shorter lengths, isolating the impact of increased context. The benchmark utilizes datasets with average input token counts ranging from 43.9k to over 2 million, pushing the limits of current LLM capabilities.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Download dataset: bash scripts/download_dataset.sh or via Hugging Face (xinrongzhang2022/InfiniteBench).
Run evaluation: python eval_<model_name>.py --task <task_name> (e.g., python eval_yarn_mistral.py --task kv_retrieval).
Requires Python environment and dataset download.

Highlighted Details

Evaluates LLMs on context lengths significantly beyond 100K tokens.
Includes 12 diverse tasks covering summarization, QA, code debugging, and mathematical reasoning.
Benchmarks performance of models like GPT-4, Kimi-Chat, Claude 2, and YaRN-Mistral-7B.
Provides detailed performance metrics for each task and model.

Maintenance & Community

The project is associated with OpenBMB and acknowledges several contributors. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's code and data are released under a permissive license, allowing for research and commercial use. The paper is published by the Association for Computational Linguistics.

Limitations & Caveats

The benchmark highlights significant performance degradation in current state-of-the-art LLMs when processing contexts beyond 100K tokens, indicating a need for further advancements in long-context handling. Some models show near-zero performance on specific tasks with extended contexts.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days