Discover and explore top open-source AI tools and projects—updated daily.
Efficient audio understanding with general audio captions
Top 78.3% on SourcePulse
MiDashengLM is an efficient audio understanding model designed for general audio captioning and analysis. It targets researchers and developers needing to process diverse audio content, offering state-of-the-art performance and significant speedups over existing models.
How It Works
MiDashengLM integrates the Dasheng audio encoder with the Qwen2.5-Omni-7B Thinker decoder. It uniquely employs general audio captions, rather than ASR transcripts, for training. This caption-based alignment strategy allows the model to holistically understand speech, environmental sounds, and music, providing a richer learning signal and improved efficiency.
Quick Start & Requirements
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_id = "mispeech/midashenglm-7b"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 days ago
Inactive