Multimodal LLM for speech, audio events, and music inputs
Top 31.5% on sourcepulse
SALMONN is a family of advanced multi-modal Large Language Models (LLMs) designed to process and understand speech, audio events, and music. Developed by Tsinghua University and ByteDance, it aims to equip LLMs with "ears" and cognitive hearing abilities, enabling emergent capabilities like multilingual speech recognition, translation, and audio-speech co-reasoning for progress towards hearing-enabled artificial general intelligence.
How It Works
SALMONN integrates audio perception by using a window-level Q-Former to fuse outputs from a Whisper speech encoder and a BEATs audio encoder. These fused audio representations are then projected into the LLM's input space. A LoRA adaptor further aligns the LLM's input and output spaces, allowing it to respond to audio inputs based on textual or even spoken commands. This approach enhances versatility and task richness beyond traditional audio processing.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README specifies significant hardware requirements (A100-SXM-80GB) for both training and inference, and requires downloading multiple large pre-trained models, indicating a substantial resource footprint and setup complexity. The license is not specified, which may pose restrictions for certain use cases.
3 weeks ago
1 day