LLaSM  by LinkSoul-AI

Open-source speech-language assistant for multimodal conversation

created 2 years ago
557 stars

Top 58.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LLaSM is an open-source, commercially viable conversational model supporting bilingual (Chinese/English) speech-text multimodal dialogue. It aims to simplify user interaction with large language models by enabling direct voice input, bypassing the complexities and potential errors of traditional Automatic Speech Recognition (ASR) pipelines.

How It Works

LLaSM integrates a large language model (LLM) with a speech processing component, allowing for end-to-end voice-based conversations. This multimodal approach directly processes audio input, fusing it with textual information for a more natural and efficient user experience. The model leverages pre-trained LLMs like Chinese-Llama-2-7B or Baichuan-7B and incorporates a speech encoder, likely based on models like Whisper, to handle audio understanding.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using conda and pip.
    git clone https://github.com/LinkSoul-AI/LLaSM
    cd LLaSM
    conda create -n llasm python=3.10 -y
    conda activate llasm
    pip install --upgrade pip
    pip install -e .
    
  • Prerequisites: CUDA-enabled GPU (for LLASM_DEVICE="cuda:0"), Python 3.10, Whisper large v2 model.
  • Demo: Available at Hugging Face Spaces.
  • Paper: arXiv:2308.15930

Highlighted Details

  • First open-source, commercially usable model for bilingual Chinese/English speech-text multimodal dialogue.
  • Offers direct voice input, avoiding separate ASR steps.
  • Supports Chinese-Llama-2-7B and Baichuan-7B LLMs.
  • Includes a Chinese/English speech SFT dataset (LLaSM-Audio-Instructions).

Maintenance & Community

  • Active development indicated by recent arXiv publication (2023).
  • Community interaction via WeChat group mentioned.

Licensing & Compatibility

  • License: Apache-2.0 license.
  • Compatibility: Permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions a TODO for int4 quantization and Docker deployment, suggesting these features may be under development or not yet fully documented.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Feedback? Help us improve.