dasheng-lm by xiaomi-research

Efficient audio understanding with general audio captions

Created 5 months ago

396 stars

Top 72.9% on SourcePulse

Project Summary

MiDashengLM is an efficient audio understanding model designed for general audio captioning and analysis. It targets researchers and developers needing to process diverse audio content, offering state-of-the-art performance and significant speedups over existing models.

How It Works

MiDashengLM integrates the Dasheng audio encoder with the Qwen2.5-Omni-7B Thinker decoder. It uniquely employs general audio captions, rather than ASR transcripts, for training. This caption-based alignment strategy allows the model to holistically understand speech, environmental sounds, and music, providing a richer learning signal and improved efficiency.

Quick Start & Requirements

Install via Hugging Face Transformers:

from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_id = "mispeech/midashenglm-7b"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Requires PyTorch and Hugging Face libraries.
Inference requires sufficient VRAM for the 7B model.

Highlighted Details

Achieves state-of-the-art performance on audio captioning, classification, and QA tasks, outperforming Qwen2.5-Omni-7B and Kimi-Audio-Instruct-7B.
Offers up to 3.2x throughput speedup at comparable batch sizes, and up to 20x with larger batches (batch size 512 tested on 30s audio, 80GB GPU).
Trained on the novel ACAVCaps dataset (38,662 hours) of general audio captions, covering speech, sound, music, and mixed content.
Full transparency with public-source training data and a reproducible pipeline.

Maintenance & Community

Developed by Xiaomi Inc. (Horizon Team, MiLM Plus).
Cites an arXiv paper (2508.03983) with contributors listed alphabetically.

Licensing & Compatibility

Licensed under Apache License 2.0, permitting both research and commercial use.

Limitations & Caveats

The ACAVCaps dataset will be released after the ICASSP 2026 review process.
While efficient, large batch sizes require significant GPU memory (e.g., 80GB for batch size 512).

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

10 stars in the last 30 days

Explore Similar Projects

WavCaps by XinhaoMei

Large-scale audio-language dataset and multimodal research tools

Created 2 years ago

Updated 1 year ago

AudioBench by AudioLLMs

A universal benchmark for evaluating audio large language models

Created 1 year ago

Updated 6 months ago

Ola by Ola-Omni

Omni-modal language model research paper

Created 11 months ago

Updated 7 months ago

UniAudio by yangdongchao

Audio foundation model for universal audio generation

Created 2 years ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

lp-music-caps by seungheondoh

Music captioning research paper using LLMs

Created 2 years ago

Updated 1 year ago

Awesome-Audio-LLM by AudioLLMs

Audio LLM resource list (models, datasets, benchmarks, surveys)

Created 1 year ago

Updated 6 months ago

awesome-large-audio-models by EmulationAI

Curated list of Large Language Models in Audio AI

Created 2 years ago

Updated 2 months ago

VITA-Audio by VITA-MLLM

Speech model for fast audio-text token generation

Created 8 months ago

Updated 7 months ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

MiMo-Audio by XiaomiMiMo

Audio language models excel at few-shot learning and generalization

Created 3 months ago

Updated 3 months ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Jinze Bai

Jinze Bai(Research Scientist at Alibaba Qwen), and

1 more.

Qwen-Audio by QwenLM

Audio-language model for audio understanding and chat

Created 2 years ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

Kimi-Audio by MoonshotAI

Audio foundation model for understanding, generation, and conversation

Created 8 months ago

Updated 6 months ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

1 more.

higgs-audio by boson-ai

Expressive text-to-audio generation model

Created 5 months ago

Updated 3 months ago

Feedback? Help us improve.