VoiceBench by MatthewCYM

Benchmark for LLM-based voice assistants

Created 1 year ago

316 stars

Top 85.6% on SourcePulse

Project Summary

VoiceBench provides a comprehensive framework for evaluating LLM-based voice assistants across various capabilities like open-ended QA, reasoning, and instruction following. It targets researchers and developers in the speech and AI fields, offering a standardized benchmark and a curated leaderboard to track progress in the rapidly evolving domain of multimodal AI.

How It Works

VoiceBench utilizes a multi-faceted evaluation approach. It employs a combination of human-recorded datasets (e.g., WildVoice, BBH) and synthetic datasets (e.g., AlpacaEval, MMSU) covering diverse tasks. For conversational benchmarks like AlpacaEval, CommonEval, WildVoice, and SD-QA, it leverages GPT-4o-mini for automated response evaluation. Other benchmarks use task-specific evaluators (e.g., MCQ for MMSU, BBH for reasoning).

Quick Start & Requirements

Install:

conda create -n voicebench python=3.10
conda activate voicebench
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install xformers==0.0.23 --no-deps
pip install -r requirements.txt

Prerequisites: Python 3.10, PyTorch 2.1.2 with CUDA 12.1 support, xformers.
Dataset: Available on Hugging Face (hlt-lab/voicebench).
Docs: VoiceBench Leaderboard

Highlighted Details

Supports evaluation across 10 distinct benchmarks including AlpacaEval, CommonEval, WildVoice, SD-QA, MMSU, OBQA, BBH, IFEval, AdvBench.
Includes a curated list of "Awesome Voice Assistants" with links to their technical reports and code.
The leaderboard is actively updated with new model submissions via the issue tracker.
Datasets include human-recorded speech for naturalness and diverse accents.

Maintenance & Community

The project actively encourages community contributions for updating the leaderboard. The primary interaction point for submissions is the issue tracker.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. The underlying datasets are hosted on Hugging Face, and their specific licenses should be consulted.

Limitations & Caveats

The README does not specify the license for the VoiceBench code itself. Some datasets are generated using Google TTS, which might have implications for commercial use depending on Google's terms. The evaluation process for certain benchmarks relies on GPT-4o-mini, which requires API access and incurs costs.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days