Open Speech Understanding Model research paper
Top 76.3% on sourcepulse
OSUM (Open Speech Understanding Model) addresses the gap in advanced speech understanding models for academia, which often lack the extensive resources of industry. It provides a transparent, multi-tasking framework for researchers to build and innovate upon, enabling comprehensive speech-based interactions.
How It Works
OSUM combines a Whisper encoder with a Qwen2 LLM, leveraging an "ASR+X" training strategy. This approach efficiently optimizes Automatic Speech Recognition (ASR) alongside other speech tasks like vocal event detection, emotion recognition, and speaker classification. This multi-task optimization allows for stable training and strong performance even with limited academic resources.
Quick Start & Requirements
pip install requirements.txt
.Highlighted Details
Maintenance & Community
The project is open-sourced by ASLP@NPU. Contact xlgeng@mail.nwpu.edu.cn
for inquiries.
Licensing & Compatibility
Licensed under Apache 2.0, permitting free use for research and commercial purposes.
Limitations & Caveats
The project is presented as a technical report (v2.0) and released checkpoints, indicating it is an active research project. Specific performance benchmarks on diverse real-world scenarios beyond those in the report are not detailed.
4 days ago
Inactive