SLAM-LLM by X-LANCE

MLLM toolkit for speech, language, audio, and music processing

Created 2 years ago

947 stars

Top 38.7% on SourcePulse

Project Summary

SLAM-LLM is a deep learning toolkit for training custom multimodal large language models (MLLMs) focused on speech, language, audio, and music processing. It provides researchers and developers with detailed training recipes and high-performance checkpoints for inference, aiming to simplify and advance research in these areas.

How It Works

SLAM-LLM leverages a unified data format and dynamic prompt selection for multi-task training, supporting diverse audio and speech tasks. It incorporates advanced training techniques like DeepSpeed for reduced memory usage and dynamic frame batching for significant time savings on large datasets. The architecture is designed for extensibility, allowing easy integration of new models and tasks.

Quick Start & Requirements

Installation: Requires cloning transformers, peft, and SLAM-LLM repositories, checking out specific tags, and installing dependencies via pip. PyTorch 2.0.1 with CUDA 11.8 is recommended. Fairseq may be needed for some examples.
Docker: A Docker image is available for easier setup.
Prerequisites: PyTorch, transformers, peft, fairseq (optional), and CUDA-enabled GPUs are recommended.
Resources: Large datasets (e.g., 100,000 hours) are supported, implying significant storage and compute requirements for training.
Links: SLAM-Omni Paper, SLAM-Omni Demo, Slack/WeChat

Highlighted Details

Supports large-scale industrial training for datasets up to 100,000 hours.
Offers full reproduction for SLAM-Omni, a timbre-controllable voice interaction system.
Includes recipes for various tasks: ASR, Contextual ASR, VSR, Speech-to-Text Translation, TTS, SEC, AAC, Spatial Audio Understanding, and Music Captioning.
Features mixed-precision training and multi-GPU support (DDP, FSDP, DeepSpeed).

Maintenance & Community

The project actively calls for contributions and examples from developers and researchers. Community updates and Q&A are synced via Slack or WeChat groups.

Licensing & Compatibility

The project appears to rely on dependencies from Hugging Face (transformers, peft) and PyTorch, which have permissive licenses (Apache 2.0, BSD-style). However, the specific license for SLAM-LLM itself is not explicitly stated in the provided README snippet.

Limitations & Caveats

DeepSpeed support is noted as "still need to be improved." The project relies on specific versions of dependencies, which might require careful management for compatibility.

SLAM-LLM by X-LANCE

Explore Similar Projects

bc-omni by westlake-baichuan-mllm

OSUM by ASLP-lab

METER by zdou0830

ltu by YuanGongND

VITA-Audio by VITA-MLLM

dataspeech by huggingface

vits-simple-api by Artrajz

athena by athena-team

PaddleMIX by PaddlePaddle

vall-e by lifeiteng

ultravox by fixie-ai

speechbrain by speechbrain