VoiceBench  by MatthewCYM

Benchmark for LLM-based voice assistants

created 9 months ago
259 stars

Top 98.4% on sourcepulse

GitHubView on GitHub
Project Summary

VoiceBench provides a comprehensive framework for evaluating LLM-based voice assistants across various capabilities like open-ended QA, reasoning, and instruction following. It targets researchers and developers in the speech and AI fields, offering a standardized benchmark and a curated leaderboard to track progress in the rapidly evolving domain of multimodal AI.

How It Works

VoiceBench utilizes a multi-faceted evaluation approach. It employs a combination of human-recorded datasets (e.g., WildVoice, BBH) and synthetic datasets (e.g., AlpacaEval, MMSU) covering diverse tasks. For conversational benchmarks like AlpacaEval, CommonEval, WildVoice, and SD-QA, it leverages GPT-4o-mini for automated response evaluation. Other benchmarks use task-specific evaluators (e.g., MCQ for MMSU, BBH for reasoning).

Quick Start & Requirements

  • Install:
    conda create -n voicebench python=3.10
    conda activate voicebench
    pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
    pip install xformers==0.0.23 --no-deps
    pip install -r requirements.txt
    
  • Prerequisites: Python 3.10, PyTorch 2.1.2 with CUDA 12.1 support, xformers.
  • Dataset: Available on Hugging Face (hlt-lab/voicebench).
  • Docs: VoiceBench Leaderboard

Highlighted Details

  • Supports evaluation across 10 distinct benchmarks including AlpacaEval, CommonEval, WildVoice, SD-QA, MMSU, OBQA, BBH, IFEval, AdvBench.
  • Includes a curated list of "Awesome Voice Assistants" with links to their technical reports and code.
  • The leaderboard is actively updated with new model submissions via the issue tracker.
  • Datasets include human-recorded speech for naturalness and diverse accents.

Maintenance & Community

The project actively encourages community contributions for updating the leaderboard. The primary interaction point for submissions is the issue tracker.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. The underlying datasets are hosted on Hugging Face, and their specific licenses should be consulted.

Limitations & Caveats

The README does not specify the license for the VoiceBench code itself. Some datasets are generated using Google TTS, which might have implications for commercial use depending on Google's terms. The evaluation process for certain benchmarks relies on GPT-4o-mini, which requires API access and incurs costs.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
76 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.