Lyra  by dvlab-research

Omni-cognition framework for speech, image, and video understanding/generation

created 8 months ago
287 stars

Top 92.3% on sourcepulse

GitHubView on GitHub
Project Summary

Lyra is an open-source framework for speech-centric omni-cognition, designed to achieve state-of-the-art performance across a variety of speech and multi-modal tasks. It targets researchers and developers working with speech understanding, generation, and multi-modal AI, offering enhanced versatility and efficiency over existing models.

How It Works

Lyra employs a multi-modal architecture where data from different modalities (image, video, speech) are processed through encoders and projectors before entering a Large Language Model (LLM). Within the LLM, multi-modality LoRA and latent multi-modality extraction modules work together to enable simultaneous speech and text generation. This approach leverages a latent cross-modality regularizer to bridge speech and language tokens, facilitating efficient training and inference.

Quick Start & Requirements

  • Install: Clone the repository and install via pip install -e . within a Python 3.10 Conda environment. Additional packages like fairseq are needed for text-speech generation.
  • Prerequisites: Python 3.10, Conda, fairseq (optional). GPU is highly recommended for training and inference.
  • Resources: Requires downloading various datasets and pre-trained models (Qwen2-VL, Whisper, etc.). Training is demonstrated on 8 A100 GPUs.
  • Links: Demo, Project Page, Code

Highlighted Details

  • Achieves SOTA results on speech-centric tasks and supports image, video, speech understanding, and speech generation.
  • Offers three model sizes: Lyra_Mini_3B, Lyra_Base_9B, and Lyra_Pro_74B.
  • Supports long speech input (up to 2-3 hours) and streaming text-speech generation.
  • Provides a Gradio Web UI for user-friendly multi-modal interaction.

Maintenance & Community

The project is actively maintained by dvlab-research. Further details on community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only. Usage is restricted by the licenses of underlying models (LLaVA, Qwen, LLaMA, Whisper, GPT-4o). The dataset is CC BY NC 4.0, prohibiting commercial use.

Limitations & Caveats

The online demo does not support long-speech functionality due to computational costs. The licensing explicitly restricts use to non-commercial, research purposes.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.