InfiniteBench  by OpenBMB

Benchmark for evaluating language models on super-long contexts (100k+ tokens)

created 1 year ago
343 stars

Top 81.8% on sourcepulse

GitHubView on GitHub
Project Summary

InfiniteBench provides a benchmark for evaluating Large Language Models (LLMs) on their ability to process and reason over contexts exceeding 100,000 tokens. It targets researchers and developers working on long-context LLMs, offering a standardized method to assess performance degradation solely due to context length across diverse tasks.

How It Works

InfiniteBench comprises 12 distinct tasks, a mix of real-world and synthetic scenarios, designed to test various aspects of language processing in extended contexts. Tasks are specifically chosen to be solvable at shorter lengths, isolating the impact of increased context. The benchmark utilizes datasets with average input token counts ranging from 43.9k to over 2 million, pushing the limits of current LLM capabilities.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Download dataset: bash scripts/download_dataset.sh or via Hugging Face (xinrongzhang2022/InfiniteBench).
  • Run evaluation: python eval_<model_name>.py --task <task_name> (e.g., python eval_yarn_mistral.py --task kv_retrieval).
  • Requires Python environment and dataset download.

Highlighted Details

  • Evaluates LLMs on context lengths significantly beyond 100K tokens.
  • Includes 12 diverse tasks covering summarization, QA, code debugging, and mathematical reasoning.
  • Benchmarks performance of models like GPT-4, Kimi-Chat, Claude 2, and YaRN-Mistral-7B.
  • Provides detailed performance metrics for each task and model.

Maintenance & Community

The project is associated with OpenBMB and acknowledges several contributors. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's code and data are released under a permissive license, allowing for research and commercial use. The paper is published by the Association for Computational Linguistics.

Limitations & Caveats

The benchmark highlights significant performance degradation in current state-of-the-art LLMs when processing contexts beyond 100K tokens, indicating a need for further advancements in long-context handling. Some models show near-zero performance on specific tasks with extended contexts.

Health Check
Last commit

10 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
1 more.

RULER by NVIDIA

1.1%
1k
Evaluation suite for long-context language models research paper
created 1 year ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

yarn by jquesnelle

1.0%
2k
Context window extension method for LLMs (research paper, models)
created 2 years ago
updated 1 year ago
Feedback? Help us improve.