AudioBench  by AudioLLMs

A universal benchmark for evaluating audio large language models

Created 1 year ago
258 stars

Top 98.2% on SourcePulse

GitHubView on GitHub
Project Summary

AudioBench is a comprehensive benchmark suite designed to evaluate the performance of Audio Large Language Models (AudioLLMs) across a wide array of tasks. It serves researchers and developers working on models that process and understand audio data, providing a standardized framework for comparison and a live leaderboard for tracking progress. The project aims to accelerate the development of more capable and versatile audio-centric AI systems.

How It Works

AudioBench standardizes the evaluation of AudioLLMs by providing a unified interface to over 50 diverse datasets covering Automatic Speech Recognition (ASR), Speech Translation, Speech Question Answering, Speech Instruction following, and various audio understanding tasks like emotion and accent recognition. It supports multiple evaluation metrics, including traditional ones like WER and BLEU, as well as advanced model-as-judge metrics leveraging LLMs like GPT-4o and Llama 3. This approach allows for a holistic assessment of model capabilities beyond simple accuracy.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt
  • Prerequisites: For model-as-judge evaluations, a vLLM server with a 1x 80GB GPU is required to host the judging model (e.g., Llama-3-70B-Instruct). A second GPU is needed for running inference on the models being evaluated.
  • Resources: Setup for model-as-judge evaluation requires significant GPU resources.
  • Links: Huggingface Space Leaderboard, Huggingface Datasets, AudioLLM Paper Collection

Highlighted Details

  • Supports over 50 datasets, including recent additions like MMAU and SEAME for code-switching evaluation.
  • Includes support for multiple languages and accents, with recent additions for Thai, Vietnamese, and Indonesian ASR.
  • Features a live leaderboard on Huggingface Spaces for tracking model performance.
  • Accommodates custom dataset loaders and new model integrations.

Maintenance & Community

The project is actively maintained, with frequent updates to supported datasets and models. The paper has been accepted to NAACL 2025. Model submissions can be made via email to bwang28c@gmail.com.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.

Limitations & Caveats

Some models, like WavLLM, are noted as no longer supported due to inference complexity. Several advanced models and benchmarks (e.g., ultravox, GLM4-Voice, AIR-Bench) are listed as "To-Do" or not yet supported, indicating ongoing development. The README does not specify the license, which could be a barrier for some users.

Health Check
Last Commit

3 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.