MOSS-Audio by OpenMOSS

Foundation model for unified audio understanding

Created 3 months ago

596 stars

Top 54.0% on SourcePulse

Project Summary

Summary

MOSS-Audio is an open-source foundation model for unified audio understanding, designed to process complex real-world audio. It targets researchers and engineers, enabling comprehensive analysis including speech, environmental sounds, music, captioning, question answering, and complex reasoning within a single framework.

How It Works

The model employs a modular architecture: a custom MOSS-Audio-Encoder, modality adapter, and LLM. Key innovations include DeepStack cross-layer feature injection, integrating features from multiple encoder layers into the LLM's early layers to preserve fine-grained acoustic details alongside high-level semantics. A time-marker insertion strategy enhances temporal awareness, enabling explicit "what happened when" understanding for time-aware QA and event localization.

Quick Start & Requirements

Requires Python 3.12 and Conda. Installation involves cloning, creating a Conda env, installing ffmpeg, and pip install -e ".[torch-runtime]" with CUDA 12.8. FlashAttention 2 is optional. Models (4B/8B parameter, Instruct/Thinking) are downloaded via hf download. Basic inference runs with python infer.py; a Gradio demo is available via python app.py. Fine-tuning scripts for LoRA and full-parameter training are provided.

Highlighted Details

MOSS-Audio-8B-Thinking leads open-source models in general audio understanding (71.08 avg accuracy). The 8B-Instruct variant achieves the top score (3.7252) in speech captioning. For ASR, it reports the lowest overall CER (11.30) on a diverse benchmark, excelling in challenging conditions like code-switching. Timestamp ASR performance is notable, with the 8B-Instruct model significantly outperforming competitors like Qwen3-Omni and Gemini-3.1-Pro.

Maintenance & Community

Developed by the OpenMOSS team and MOSI.AI. Specific community channels, active development, or notable contributors are not detailed in the provided README.

Licensing & Compatibility

Models are released under the Apache License 2.0, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The README does not explicitly detail known limitations, alpha/beta status, or specific unsupported platforms. A technical report and blog post are noted as "coming soon."

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

30 stars in the last 30 days