Discover and explore top open-source AI tools and projects—updated daily.
OpenMOSSFoundation model for unified audio understanding
Top 63.6% on SourcePulse
Summary
MOSS-Audio is an open-source foundation model for unified audio understanding, designed to process complex real-world audio. It targets researchers and engineers, enabling comprehensive analysis including speech, environmental sounds, music, captioning, question answering, and complex reasoning within a single framework.
How It Works
The model employs a modular architecture: a custom MOSS-Audio-Encoder, modality adapter, and LLM. Key innovations include DeepStack cross-layer feature injection, integrating features from multiple encoder layers into the LLM's early layers to preserve fine-grained acoustic details alongside high-level semantics. A time-marker insertion strategy enhances temporal awareness, enabling explicit "what happened when" understanding for time-aware QA and event localization.
Quick Start & Requirements
Requires Python 3.12 and Conda. Installation involves cloning, creating a Conda env, installing ffmpeg, and pip install -e ".[torch-runtime]" with CUDA 12.8. FlashAttention 2 is optional. Models (4B/8B parameter, Instruct/Thinking) are downloaded via hf download. Basic inference runs with python infer.py; a Gradio demo is available via python app.py. Fine-tuning scripts for LoRA and full-parameter training are provided.
Highlighted Details
MOSS-Audio-8B-Thinking leads open-source models in general audio understanding (71.08 avg accuracy). The 8B-Instruct variant achieves the top score (3.7252) in speech captioning. For ASR, it reports the lowest overall CER (11.30) on a diverse benchmark, excelling in challenging conditions like code-switching. Timestamp ASR performance is notable, with the 8B-Instruct model significantly outperforming competitors like Qwen3-Omni and Gemini-3.1-Pro.
Maintenance & Community
Developed by the OpenMOSS team and MOSI.AI. Specific community channels, active development, or notable contributors are not detailed in the provided README.
Licensing & Compatibility
Models are released under the Apache License 2.0, which is permissive for commercial use and integration into closed-source projects.
Limitations & Caveats
The README does not explicitly detail known limitations, alpha/beta status, or specific unsupported platforms. A technical report and blog post are noted as "coming soon."
1 week ago
Inactive
lucidrains