OSUM by ASLP-lab

Open Speech Understanding Model research paper

Created 11 months ago

468 stars

Top 64.9% on SourcePulse

Project Summary

OSUM (Open Speech Understanding Model) addresses the gap in advanced speech understanding models for academia, which often lack the extensive resources of industry. It provides a transparent, multi-tasking framework for researchers to build and innovate upon, enabling comprehensive speech-based interactions.

How It Works

OSUM combines a Whisper encoder with a Qwen2 LLM, leveraging an "ASR+X" training strategy. This approach efficiently optimizes Automatic Speech Recognition (ASR) alongside other speech tasks like vocal event detection, emotion recognition, and speaker classification. This multi-task optimization allows for stable training and strong performance even with limited academic resources.

Quick Start & Requirements

Install via pip install requirements.txt.
Refer to the official documentation for inference and training instructions.

Highlighted Details

Supports a wide range of speech tasks: ASR, SRWT, VED, SER, SSR, SGC, SAP, and STTC.
Achieves competitive performance against models like Qwen2-Audio with fewer resources.
Technical report v2.0 details increased training data (50.5K hours) and model improvements.
Offers an online test page for immediate evaluation.

Maintenance & Community

The project is open-sourced by ASLP@NPU. Contact xlgeng@mail.nwpu.edu.cn for inquiries.

Licensing & Compatibility

Licensed under Apache 2.0, permitting free use for research and commercial purposes.

Limitations & Caveats

The project is presented as a technical report (v2.0) and released checkpoints, indicating it is an active research project. Specific performance benchmarks on diverse real-world scenarios beyond those in the report are not detailed.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days