Benchmark for evaluating language models on super-long contexts (100k+ tokens)
Top 81.8% on sourcepulse
InfiniteBench provides a benchmark for evaluating Large Language Models (LLMs) on their ability to process and reason over contexts exceeding 100,000 tokens. It targets researchers and developers working on long-context LLMs, offering a standardized method to assess performance degradation solely due to context length across diverse tasks.
How It Works
InfiniteBench comprises 12 distinct tasks, a mix of real-world and synthetic scenarios, designed to test various aspects of language processing in extended contexts. Tasks are specifically chosen to be solvable at shorter lengths, isolating the impact of increased context. The benchmark utilizes datasets with average input token counts ranging from 43.9k to over 2 million, pushing the limits of current LLM capabilities.
Quick Start & Requirements
pip install -r requirements.txt
bash scripts/download_dataset.sh
or via Hugging Face (xinrongzhang2022/InfiniteBench
).python eval_<model_name>.py --task <task_name>
(e.g., python eval_yarn_mistral.py --task kv_retrieval
).Highlighted Details
Maintenance & Community
The project is associated with OpenBMB and acknowledges several contributors. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The repository's code and data are released under a permissive license, allowing for research and commercial use. The paper is published by the Association for Computational Linguistics.
Limitations & Caveats
The benchmark highlights significant performance degradation in current state-of-the-art LLMs when processing contexts beyond 100K tokens, indicating a need for further advancements in long-context handling. Some models show near-zero performance on specific tasks with extended contexts.
10 months ago
1 week