Ola by Ola-Omni

Omni-modal language model research paper

Created 11 months ago

382 stars

Top 74.8% on SourcePulse

Project Summary

Ola is an omni-modal language model designed for comprehensive understanding across text, image, video, and audio modalities. It targets researchers and developers seeking to build advanced multi-modal AI systems, offering competitive performance against specialized models through its novel progressive modality alignment strategy and unified architecture.

How It Works

Ola employs an omni-modal architecture capable of processing diverse inputs simultaneously. Its core innovation lies in a progressive alignment training strategy, where speech acts as a bridge between language and audio, and video connects visual and audio information. This approach, coupled with custom cross-modality video-audio data, aims to enhance the model's ability to capture inter-modal relationships effectively.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n ola python=3.10), activate it (conda activate ola), and install with pip install -e .. For training, install with pip install -e ".[train]" and flash-attn --no-build-isolation.
Prerequisites: Python 3.10, PyTorch. For full functionality, download specific audio encoder weights (large-v3.pt, BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt) from Huggingface.
Resources: Requires downloading model checkpoints (e.g., Ola-7b).
Links: Project Page: https://ola-omni.github.io, Huggingface Models: https://huggingface.co/collections/THUdyh/ola-67b8220eb93406ec87aeec37, Demo: https://huggingface.co/spaces/THUdyh/Ola.

Highlighted Details

Achieves Rank #1 on OpenCompass Multi-modal Leaderboard (under 15B parameters) with an average score of 72.6.
Supports real-time streaming decoding for text and speech.
Provides intermediate models (Ola-Image, Ola-Video) for custom model building.
Codebase is conducted on LLaVA.

Maintenance & Community

Active development with recent releases in February 2025.
Contact via GitHub issues or email (liuzuyan19@gmail.com).
Acknowledgements include LLaVA and VLMEvalKit teams.

Licensing & Compatibility

The repository does not explicitly state a license. The codebase is based on LLaVA, which is Apache 2.0 licensed. Further clarification on Ola's specific licensing is recommended for commercial use.

Limitations & Caveats

Evaluation code for omni-modal benchmarks is listed as "Coming Soon."
Requires manual download and placement of specific audio encoder weights for audio processing.

Health Check

Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days