SALMONN by bytedance

Multimodal LLM for speech, audio events, and music inputs

Created 2 years ago

1,377 stars

Top 29.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Project Summary

SALMONN is a family of advanced multi-modal Large Language Models (LLMs) designed to process and understand speech, audio events, and music. Developed by Tsinghua University and ByteDance, it aims to equip LLMs with "ears" and cognitive hearing abilities, enabling emergent capabilities like multilingual speech recognition, translation, and audio-speech co-reasoning for progress towards hearing-enabled artificial general intelligence.

How It Works

SALMONN integrates audio perception by using a window-level Q-Former to fuse outputs from a Whisper speech encoder and a BEATs audio encoder. These fused audio representations are then projected into the LLM's input space. A LoRA adaptor further aligns the LLM's input and output spaces, allowing it to respond to audio inputs based on textual or even spoken commands. This approach enhances versatility and task richness beyond traditional audio processing.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.9.17, Whisper large v2, Fine-tuned BEATs_iter3+ (AS2M) (cpt2), Vicuna 13B v1.1. Requires A100-SXM-80GB GPU for training and inference.
Resources: Requires downloading multiple large model checkpoints (Whisper, BEATs, Vicuna, SALMONN checkpoints).
Docs: https://github.com/bytedance/SALMONN

Highlighted Details

Supports multilingual speech recognition and translation.
Achieves emergent capabilities like audio-speech co-reasoning and understanding spoken commands.
Paper accepted by ICLR 2024.
Released video-SALMONN and data processing scripts for speech quality assessment.

Maintenance & Community

Developed by teams from Tsinghua University and ByteDance.
Key contributors listed in the README.
Paper citations available for SALMONN and video-SALMONN.

Licensing & Compatibility

License details are not explicitly stated in the README. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The README specifies significant hardware requirements (A100-SXM-80GB) for both training and inference, and requires downloading multiple large pre-trained models, indicating a substantial resource footprint and setup complexity. The license is not specified, which may pose restrictions for certain use cases.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days