Lyra  by dvlab-research

Omni-cognition framework for speech, image, and video understanding/generation

Created 9 months ago
295 stars

Top 89.8% on SourcePulse

GitHubView on GitHub
Project Summary

Lyra is an open-source framework for speech-centric omni-cognition, designed to achieve state-of-the-art performance across a variety of speech and multi-modal tasks. It targets researchers and developers working with speech understanding, generation, and multi-modal AI, offering enhanced versatility and efficiency over existing models.

How It Works

Lyra employs a multi-modal architecture where data from different modalities (image, video, speech) are processed through encoders and projectors before entering a Large Language Model (LLM). Within the LLM, multi-modality LoRA and latent multi-modality extraction modules work together to enable simultaneous speech and text generation. This approach leverages a latent cross-modality regularizer to bridge speech and language tokens, facilitating efficient training and inference.

Quick Start & Requirements

  • Install: Clone the repository and install via pip install -e . within a Python 3.10 Conda environment. Additional packages like fairseq are needed for text-speech generation.
  • Prerequisites: Python 3.10, Conda, fairseq (optional). GPU is highly recommended for training and inference.
  • Resources: Requires downloading various datasets and pre-trained models (Qwen2-VL, Whisper, etc.). Training is demonstrated on 8 A100 GPUs.
  • Links: Demo, Project Page, Code

Highlighted Details

  • Achieves SOTA results on speech-centric tasks and supports image, video, speech understanding, and speech generation.
  • Offers three model sizes: Lyra_Mini_3B, Lyra_Base_9B, and Lyra_Pro_74B.
  • Supports long speech input (up to 2-3 hours) and streaming text-speech generation.
  • Provides a Gradio Web UI for user-friendly multi-modal interaction.

Maintenance & Community

The project is actively maintained by dvlab-research. Further details on community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only. Usage is restricted by the licenses of underlying models (LLaVA, Qwen, LLaMA, Whisper, GPT-4o). The dataset is CC BY NC 4.0, prohibiting commercial use.

Limitations & Caveats

The online demo does not support long-speech functionality due to computational costs. The licensing explicitly restricts use to non-commercial, research purposes.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

ultravox by fixie-ai

0.2%
4k
Multimodal LLM for real-time voice interactions
Created 1 year ago
Updated 2 weeks ago
Feedback? Help us improve.