SenseVoice by FunAudioLLM

Multilingual speech model for understanding voice

Created 1 year ago

7,553 stars

Top 6.8% on SourcePulse

1 Expert Loves This Project

osanseviero

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

SenseVoice is a multilingual speech foundation model offering Automatic Speech Recognition (ASR), Spoken Language Identification (LID), Speech Emotion Recognition (SER), and Audio Event Detection (AED). It targets developers and researchers needing high-accuracy, low-latency speech processing across multiple languages, providing a significant performance uplift over models like Whisper.

How It Works

SenseVoice employs a non-autoregressive end-to-end framework for efficient inference. It is trained on over 400,000 hours of multilingual data, enabling robust performance across its diverse speech understanding capabilities. The model architecture is designed for low latency, making it suitable for real-time applications.

Quick Start & Requirements

Install via pip: pip install -r requirements.txt
Requires Python and potentially CUDA for GPU acceleration.
Official Docs: Install, Usage
Model Zoo: ModelScope, Hugging Face
Online Demos: ModelScope, Hugging Face

Highlighted Details

Multilingual ASR: Supports over 50 languages, outperforming Whisper on benchmark datasets.
Advanced SER and AED: Achieves state-of-the-art results in emotion and audio event detection.
Efficient Inference: SenseVoice-Small processes 10 seconds of audio in 70ms, 15x faster than Whisper-Large.
Exportable: Supports ONNX and Libtorch formats for broader deployment.
Finetuning: Provides scripts and strategies for custom model adaptation.

Maintenance & Community

Active development with recent updates in July and November 2024.
Community support via GitHub Issues and DingTalk group.
Related projects include FunASR, CosyVoice, and SenseVoice.cpp.

Licensing & Compatibility

The specific license is not explicitly stated in the README, but it is associated with Alibaba's iic/FunAudioLLM. Compatibility for commercial use should be verified.

Limitations & Caveats

Audio Event Detection performance has some gaps compared to specialized AED models due to training data limitations.
Pseudo-streaming via streaming-sensevoice sacrifices some accuracy for lower latency.

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

1

Issues (30d)

0

Star History

141 stars in the last 30 days

Explore Similar Projects

praises by ElmTran

Text-to-speech tool for easy reading

Created 1 year ago

Updated 7 months ago

speech-recognition-uk by egorsmkv

Resource collection for Ukrainian speech AI

Created 5 years ago

Updated 5 months ago

Voila by maitrix-org

Voice-language foundation models for real-time human-AI interaction

Created 11 months ago

Updated 9 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic).

ollama-voice-mac by apeatling

Offline voice assistant for macOS

Created 2 years ago

Updated 6 months ago

vits-simple-api by Artrajz

HTTP API for VITS-based text-to-speech and voice conversion

Created 3 years ago

Updated 4 months ago

Qwen3-ASR by QwenLM

Advanced multilingual speech recognition and alignment

Created 4 weeks ago

Updated 3 weeks ago

Starred by

Dan Guido

Dan Guido(Cofounder of Trail of Bits),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

2 more.

ichigo by janhq

Speech package for local, real-time voice AI development

Created 1 year ago

Updated 3 months ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen).

Qwen2-Audio by QwenLM

Audio-language model for audio analysis and voice chat

Created 1 year ago

Updated 10 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Michael Han

Michael Han(Cofounder of Unsloth), and

1 more.

Orpheus-TTS by canopyai

Open-source TTS for human-sounding speech, built on Llama-3b

Created 11 months ago

Updated 2 months ago

sherpa-onnx by k2-fsa

Speech toolkit for local, offline speech AI tasks via ONNX

Created 3 years ago

Updated 1 day ago

FunASR by modelscope

Speech recognition toolkit for bridging research and industrial applications

Created 3 years ago

Updated 3 weeks ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Piotr Dąbkowski

Piotr Dąbkowski(Cofounder of ElevenLabs), and

2 more.

PaddleSpeech by PaddlePaddle

Speech toolkit for ASR, TTS, speaker verification, translation, and keyword spotting

Created 8 years ago

Updated 2 weeks ago

Feedback? Help us improve.