OSUM  by ASLP-lab

Open Speech Understanding Model research paper

created 6 months ago
378 stars

Top 76.3% on sourcepulse

GitHubView on GitHub
Project Summary

OSUM (Open Speech Understanding Model) addresses the gap in advanced speech understanding models for academia, which often lack the extensive resources of industry. It provides a transparent, multi-tasking framework for researchers to build and innovate upon, enabling comprehensive speech-based interactions.

How It Works

OSUM combines a Whisper encoder with a Qwen2 LLM, leveraging an "ASR+X" training strategy. This approach efficiently optimizes Automatic Speech Recognition (ASR) alongside other speech tasks like vocal event detection, emotion recognition, and speaker classification. This multi-task optimization allows for stable training and strong performance even with limited academic resources.

Quick Start & Requirements

  • Install via pip install requirements.txt.
  • Refer to the official documentation for inference and training instructions.

Highlighted Details

  • Supports a wide range of speech tasks: ASR, SRWT, VED, SER, SSR, SGC, SAP, and STTC.
  • Achieves competitive performance against models like Qwen2-Audio with fewer resources.
  • Technical report v2.0 details increased training data (50.5K hours) and model improvements.
  • Offers an online test page for immediate evaluation.

Maintenance & Community

The project is open-sourced by ASLP@NPU. Contact xlgeng@mail.nwpu.edu.cn for inquiries.

Licensing & Compatibility

Licensed under Apache 2.0, permitting free use for research and commercial purposes.

Limitations & Caveats

The project is presented as a technical report (v2.0) and released checkpoints, indicating it is an active research project. Specific performance benchmarks on diverse real-world scenarios beyond those in the report are not detailed.

Health Check
Last commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
19 stars in the last 90 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
19 more.

whisper by openai

0.4%
86k
Speech recognition model for multilingual transcription/translation
created 2 years ago
updated 1 month ago
Feedback? Help us improve.