Benchmark for LLM-based voice assistants
Top 98.4% on sourcepulse
VoiceBench provides a comprehensive framework for evaluating LLM-based voice assistants across various capabilities like open-ended QA, reasoning, and instruction following. It targets researchers and developers in the speech and AI fields, offering a standardized benchmark and a curated leaderboard to track progress in the rapidly evolving domain of multimodal AI.
How It Works
VoiceBench utilizes a multi-faceted evaluation approach. It employs a combination of human-recorded datasets (e.g., WildVoice, BBH) and synthetic datasets (e.g., AlpacaEval, MMSU) covering diverse tasks. For conversational benchmarks like AlpacaEval, CommonEval, WildVoice, and SD-QA, it leverages GPT-4o-mini for automated response evaluation. Other benchmarks use task-specific evaluators (e.g., MCQ for MMSU, BBH for reasoning).
Quick Start & Requirements
conda create -n voicebench python=3.10
conda activate voicebench
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install xformers==0.0.23 --no-deps
pip install -r requirements.txt
hlt-lab/voicebench
).Highlighted Details
Maintenance & Community
The project actively encourages community contributions for updating the leaderboard. The primary interaction point for submissions is the issue tracker.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. The underlying datasets are hosted on Hugging Face, and their specific licenses should be consulted.
Limitations & Caveats
The README does not specify the license for the VoiceBench code itself. Some datasets are generated using Google TTS, which might have implications for commercial use depending on Google's terms. The evaluation process for certain benchmarks relies on GPT-4o-mini, which requires API access and incurs costs.
2 weeks ago
1 day