SenseVoice  by FunAudioLLM

Multilingual speech model for understanding voice

created 1 year ago
6,274 stars

Top 8.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SenseVoice is a multilingual speech foundation model offering Automatic Speech Recognition (ASR), Spoken Language Identification (LID), Speech Emotion Recognition (SER), and Audio Event Detection (AED). It targets developers and researchers needing high-accuracy, low-latency speech processing across multiple languages, providing a significant performance uplift over models like Whisper.

How It Works

SenseVoice employs a non-autoregressive end-to-end framework for efficient inference. It is trained on over 400,000 hours of multilingual data, enabling robust performance across its diverse speech understanding capabilities. The model architecture is designed for low latency, making it suitable for real-time applications.

Quick Start & Requirements

Highlighted Details

  • Multilingual ASR: Supports over 50 languages, outperforming Whisper on benchmark datasets.
  • Advanced SER and AED: Achieves state-of-the-art results in emotion and audio event detection.
  • Efficient Inference: SenseVoice-Small processes 10 seconds of audio in 70ms, 15x faster than Whisper-Large.
  • Exportable: Supports ONNX and Libtorch formats for broader deployment.
  • Finetuning: Provides scripts and strategies for custom model adaptation.

Maintenance & Community

  • Active development with recent updates in July and November 2024.
  • Community support via GitHub Issues and DingTalk group.
  • Related projects include FunASR, CosyVoice, and SenseVoice.cpp.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but it is associated with Alibaba's iic/FunAudioLLM. Compatibility for commercial use should be verified.

Limitations & Caveats

  • Audio Event Detection performance has some gaps compared to specialized AED models due to training data limitations.
  • Pseudo-streaming via streaming-sensevoice sacrifices some accuracy for lower latency.
Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
5
Star History
800 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
19 more.

whisper by openai

0.4%
86k
Speech recognition model for multilingual transcription/translation
created 2 years ago
updated 1 month ago
Feedback? Help us improve.