SALMONN  by bytedance

Multimodal LLM for speech, audio events, and music inputs

created 2 years ago
1,291 stars

Top 31.5% on sourcepulse

GitHubView on GitHub
Project Summary

SALMONN is a family of advanced multi-modal Large Language Models (LLMs) designed to process and understand speech, audio events, and music. Developed by Tsinghua University and ByteDance, it aims to equip LLMs with "ears" and cognitive hearing abilities, enabling emergent capabilities like multilingual speech recognition, translation, and audio-speech co-reasoning for progress towards hearing-enabled artificial general intelligence.

How It Works

SALMONN integrates audio perception by using a window-level Q-Former to fuse outputs from a Whisper speech encoder and a BEATs audio encoder. These fused audio representations are then projected into the LLM's input space. A LoRA adaptor further aligns the LLM's input and output spaces, allowing it to respond to audio inputs based on textual or even spoken commands. This approach enhances versatility and task richness beyond traditional audio processing.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.9.17, Whisper large v2, Fine-tuned BEATs_iter3+ (AS2M) (cpt2), Vicuna 13B v1.1. Requires A100-SXM-80GB GPU for training and inference.
  • Resources: Requires downloading multiple large model checkpoints (Whisper, BEATs, Vicuna, SALMONN checkpoints).
  • Docs: https://github.com/bytedance/SALMONN

Highlighted Details

  • Supports multilingual speech recognition and translation.
  • Achieves emergent capabilities like audio-speech co-reasoning and understanding spoken commands.
  • Paper accepted by ICLR 2024.
  • Released video-SALMONN and data processing scripts for speech quality assessment.

Maintenance & Community

  • Developed by teams from Tsinghua University and ByteDance.
  • Key contributors listed in the README.
  • Paper citations available for SALMONN and video-SALMONN.

Licensing & Compatibility

  • License details are not explicitly stated in the README. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The README specifies significant hardware requirements (A100-SXM-80GB) for both training and inference, and requires downloading multiple large pre-trained models, indicating a substantial resource footprint and setup complexity. The license is not specified, which may pose restrictions for certain use cases.

Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
79 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.