Lyra by JIA-Lab-research

Omni-cognition framework for speech, image, and video understanding/generation

Created 1 year ago

304 stars

Top 88.2% on SourcePulse

Project Summary

Lyra is an open-source framework for speech-centric omni-cognition, designed to achieve state-of-the-art performance across a variety of speech and multi-modal tasks. It targets researchers and developers working with speech understanding, generation, and multi-modal AI, offering enhanced versatility and efficiency over existing models.

How It Works

Lyra employs a multi-modal architecture where data from different modalities (image, video, speech) are processed through encoders and projectors before entering a Large Language Model (LLM). Within the LLM, multi-modality LoRA and latent multi-modality extraction modules work together to enable simultaneous speech and text generation. This approach leverages a latent cross-modality regularizer to bridge speech and language tokens, facilitating efficient training and inference.

Quick Start & Requirements

Install: Clone the repository and install via pip install -e . within a Python 3.10 Conda environment. Additional packages like fairseq are needed for text-speech generation.
Prerequisites: Python 3.10, Conda, fairseq (optional). GPU is highly recommended for training and inference.
Resources: Requires downloading various datasets and pre-trained models (Qwen2-VL, Whisper, etc.). Training is demonstrated on 8 A100 GPUs.
Links: Demo, Project Page, Code

Highlighted Details

Achieves SOTA results on speech-centric tasks and supports image, video, speech understanding, and speech generation.
Offers three model sizes: Lyra_Mini_3B, Lyra_Base_9B, and Lyra_Pro_74B.
Supports long speech input (up to 2-3 hours) and streaming text-speech generation.
Provides a Gradio Web UI for user-friendly multi-modal interaction.

Maintenance & Community

The project is actively maintained by dvlab-research. Further details on community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only. Usage is restricted by the licenses of underlying models (LLaVA, Qwen, LLaMA, Whisper, GPT-4o). The dataset is CC BY NC 4.0, prohibiting commercial use.

Limitations & Caveats

The online demo does not support long-speech functionality due to computational costs. The licensing explicitly restricts use to non-commercial, research purposes.

Lyra by JIA-Lab-research

Explore Similar Projects

Ola by Ola-Omni

MGM-Omni by JIA-Lab-research

VSP-LLM by Sally-SH

MR-Models by mtkresearch

Stream-Omni by ictnlp

fish-diffusion by fishaudio

mini-omni2 by gpt-omni

ultravox by fixie-ai

parler-tts by huggingface

higgs-audio by boson-ai

Spark-TTS by SparkAudio

speechbrain by speechbrain