OSUM  by ASLP-lab

Open Speech Understanding Model research paper

Created 8 months ago
423 stars

Top 69.6% on SourcePulse

GitHubView on GitHub
Project Summary

OSUM (Open Speech Understanding Model) addresses the gap in advanced speech understanding models for academia, which often lack the extensive resources of industry. It provides a transparent, multi-tasking framework for researchers to build and innovate upon, enabling comprehensive speech-based interactions.

How It Works

OSUM combines a Whisper encoder with a Qwen2 LLM, leveraging an "ASR+X" training strategy. This approach efficiently optimizes Automatic Speech Recognition (ASR) alongside other speech tasks like vocal event detection, emotion recognition, and speaker classification. This multi-task optimization allows for stable training and strong performance even with limited academic resources.

Quick Start & Requirements

  • Install via pip install requirements.txt.
  • Refer to the official documentation for inference and training instructions.

Highlighted Details

  • Supports a wide range of speech tasks: ASR, SRWT, VED, SER, SSR, SGC, SAP, and STTC.
  • Achieves competitive performance against models like Qwen2-Audio with fewer resources.
  • Technical report v2.0 details increased training data (50.5K hours) and model improvements.
  • Offers an online test page for immediate evaluation.

Maintenance & Community

The project is open-sourced by ASLP@NPU. Contact xlgeng@mail.nwpu.edu.cn for inquiries.

Licensing & Compatibility

Licensed under Apache 2.0, permitting free use for research and commercial purposes.

Limitations & Caveats

The project is presented as a technical report (v2.0) and released checkpoints, indicating it is an active research project. Specific performance benchmarks on diverse real-world scenarios beyond those in the report are not detailed.

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
6
Star History
41 stars in the last 30 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Jinze Bai Jinze Bai(Research Scientist at Alibaba Qwen), and
1 more.

Qwen-Audio by QwenLM

0.4%
2k
Audio-language model for audio understanding and chat
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.