SLAM-LLM  by X-LANCE

MLLM toolkit for speech, language, audio, and music processing

created 1 year ago
863 stars

Top 42.4% on sourcepulse

GitHubView on GitHub
Project Summary

SLAM-LLM is a deep learning toolkit for training custom multimodal large language models (MLLMs) focused on speech, language, audio, and music processing. It provides researchers and developers with detailed training recipes and high-performance checkpoints for inference, aiming to simplify and advance research in these areas.

How It Works

SLAM-LLM leverages a unified data format and dynamic prompt selection for multi-task training, supporting diverse audio and speech tasks. It incorporates advanced training techniques like DeepSpeed for reduced memory usage and dynamic frame batching for significant time savings on large datasets. The architecture is designed for extensibility, allowing easy integration of new models and tasks.

Quick Start & Requirements

  • Installation: Requires cloning transformers, peft, and SLAM-LLM repositories, checking out specific tags, and installing dependencies via pip. PyTorch 2.0.1 with CUDA 11.8 is recommended. Fairseq may be needed for some examples.
  • Docker: A Docker image is available for easier setup.
  • Prerequisites: PyTorch, transformers, peft, fairseq (optional), and CUDA-enabled GPUs are recommended.
  • Resources: Large datasets (e.g., 100,000 hours) are supported, implying significant storage and compute requirements for training.
  • Links: SLAM-Omni Paper, SLAM-Omni Demo, Slack/WeChat

Highlighted Details

  • Supports large-scale industrial training for datasets up to 100,000 hours.
  • Offers full reproduction for SLAM-Omni, a timbre-controllable voice interaction system.
  • Includes recipes for various tasks: ASR, Contextual ASR, VSR, Speech-to-Text Translation, TTS, SEC, AAC, Spatial Audio Understanding, and Music Captioning.
  • Features mixed-precision training and multi-GPU support (DDP, FSDP, DeepSpeed).

Maintenance & Community

The project actively calls for contributions and examples from developers and researchers. Community updates and Q&A are synced via Slack or WeChat groups.

Licensing & Compatibility

The project appears to rely on dependencies from Hugging Face (transformers, peft) and PyTorch, which have permissive licenses (Apache 2.0, BSD-style). However, the specific license for SLAM-LLM itself is not explicitly stated in the provided README snippet.

Limitations & Caveats

DeepSpeed support is noted as "still need to be improved." The project relies on specific versions of dependencies, which might require careful management for compatibility.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
72 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.