Omni-cognition framework for speech, image, and video understanding/generation
Top 92.3% on sourcepulse
Lyra is an open-source framework for speech-centric omni-cognition, designed to achieve state-of-the-art performance across a variety of speech and multi-modal tasks. It targets researchers and developers working with speech understanding, generation, and multi-modal AI, offering enhanced versatility and efficiency over existing models.
How It Works
Lyra employs a multi-modal architecture where data from different modalities (image, video, speech) are processed through encoders and projectors before entering a Large Language Model (LLM). Within the LLM, multi-modality LoRA and latent multi-modality extraction modules work together to enable simultaneous speech and text generation. This approach leverages a latent cross-modality regularizer to bridge speech and language tokens, facilitating efficient training and inference.
Quick Start & Requirements
pip install -e .
within a Python 3.10 Conda environment. Additional packages like fairseq
are needed for text-speech generation.fairseq
(optional). GPU is highly recommended for training and inference.Highlighted Details
Maintenance & Community
The project is actively maintained by dvlab-research. Further details on community channels or roadmaps are not explicitly provided in the README.
Licensing & Compatibility
The data and checkpoints are licensed for research use only. Usage is restricted by the licenses of underlying models (LLaVA, Qwen, LLaMA, Whisper, GPT-4o). The dataset is CC BY NC 4.0, prohibiting commercial use.
Limitations & Caveats
The online demo does not support long-speech functionality due to computational costs. The licensing explicitly restricts use to non-commercial, research purposes.
6 months ago
1 day