dasheng-lm  by xiaomi-research

Efficient audio understanding with general audio captions

Created 1 month ago
356 stars

Top 78.3% on SourcePulse

GitHubView on GitHub
Project Summary

MiDashengLM is an efficient audio understanding model designed for general audio captioning and analysis. It targets researchers and developers needing to process diverse audio content, offering state-of-the-art performance and significant speedups over existing models.

How It Works

MiDashengLM integrates the Dasheng audio encoder with the Qwen2.5-Omni-7B Thinker decoder. It uniquely employs general audio captions, rather than ASR transcripts, for training. This caption-based alignment strategy allows the model to holistically understand speech, environmental sounds, and music, providing a richer learning signal and improved efficiency.

Quick Start & Requirements

  • Install via Hugging Face Transformers:
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_id = "mispeech/midashenglm-7b"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
  • Requires PyTorch and Hugging Face libraries.
  • Inference requires sufficient VRAM for the 7B model.

Highlighted Details

  • Achieves state-of-the-art performance on audio captioning, classification, and QA tasks, outperforming Qwen2.5-Omni-7B and Kimi-Audio-Instruct-7B.
  • Offers up to 3.2x throughput speedup at comparable batch sizes, and up to 20x with larger batches (batch size 512 tested on 30s audio, 80GB GPU).
  • Trained on the novel ACAVCaps dataset (38,662 hours) of general audio captions, covering speech, sound, music, and mixed content.
  • Full transparency with public-source training data and a reproducible pipeline.

Maintenance & Community

  • Developed by Xiaomi Inc. (Horizon Team, MiLM Plus).
  • Cites an arXiv paper (2508.03983) with contributors listed alphabetically.

Licensing & Compatibility

  • Licensed under Apache License 2.0, permitting both research and commercial use.

Limitations & Caveats

  • The ACAVCaps dataset will be released after the ICASSP 2026 review process.
  • While efficient, large batch sizes require significant GPU memory (e.g., 80GB for batch size 512).
Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
6
Star History
33 stars in the last 30 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Jinze Bai Jinze Bai(Research Scientist at Alibaba Qwen), and
1 more.

Qwen-Audio by QwenLM

0.4%
2k
Audio-language model for audio understanding and chat
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.