SenseVoice  by FunAudioLLM

Multilingual speech model for understanding voice

Created 1 year ago
7,335 stars

Top 7.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SenseVoice is a multilingual speech foundation model offering Automatic Speech Recognition (ASR), Spoken Language Identification (LID), Speech Emotion Recognition (SER), and Audio Event Detection (AED). It targets developers and researchers needing high-accuracy, low-latency speech processing across multiple languages, providing a significant performance uplift over models like Whisper.

How It Works

SenseVoice employs a non-autoregressive end-to-end framework for efficient inference. It is trained on over 400,000 hours of multilingual data, enabling robust performance across its diverse speech understanding capabilities. The model architecture is designed for low latency, making it suitable for real-time applications.

Quick Start & Requirements

Highlighted Details

  • Multilingual ASR: Supports over 50 languages, outperforming Whisper on benchmark datasets.
  • Advanced SER and AED: Achieves state-of-the-art results in emotion and audio event detection.
  • Efficient Inference: SenseVoice-Small processes 10 seconds of audio in 70ms, 15x faster than Whisper-Large.
  • Exportable: Supports ONNX and Libtorch formats for broader deployment.
  • Finetuning: Provides scripts and strategies for custom model adaptation.

Maintenance & Community

  • Active development with recent updates in July and November 2024.
  • Community support via GitHub Issues and DingTalk group.
  • Related projects include FunASR, CosyVoice, and SenseVoice.cpp.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but it is associated with Alibaba's iic/FunAudioLLM. Compatibility for commercial use should be verified.

Limitations & Caveats

  • Audio Event Detection performance has some gaps compared to specialized AED models due to training data limitations.
  • Pseudo-streaming via streaming-sensevoice sacrifices some accuracy for lower latency.
Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
3
Issues (30d)
2
Star History
202 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.2%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 10 months ago
Updated 1 month ago
Feedback? Help us improve.