Ola  by Ola-Omni

Omni-modal language model research paper

Created 7 months ago
366 stars

Top 76.9% on SourcePulse

GitHubView on GitHub
Project Summary

Ola is an omni-modal language model designed for comprehensive understanding across text, image, video, and audio modalities. It targets researchers and developers seeking to build advanced multi-modal AI systems, offering competitive performance against specialized models through its novel progressive modality alignment strategy and unified architecture.

How It Works

Ola employs an omni-modal architecture capable of processing diverse inputs simultaneously. Its core innovation lies in a progressive alignment training strategy, where speech acts as a bridge between language and audio, and video connects visual and audio information. This approach, coupled with custom cross-modality video-audio data, aims to enhance the model's ability to capture inter-modal relationships effectively.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n ola python=3.10), activate it (conda activate ola), and install with pip install -e .. For training, install with pip install -e ".[train]" and flash-attn --no-build-isolation.
  • Prerequisites: Python 3.10, PyTorch. For full functionality, download specific audio encoder weights (large-v3.pt, BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt) from Huggingface.
  • Resources: Requires downloading model checkpoints (e.g., Ola-7b).
  • Links: Project Page: https://ola-omni.github.io, Huggingface Models: https://huggingface.co/collections/THUdyh/ola-67b8220eb93406ec87aeec37, Demo: https://huggingface.co/spaces/THUdyh/Ola.

Highlighted Details

  • Achieves Rank #1 on OpenCompass Multi-modal Leaderboard (under 15B parameters) with an average score of 72.6.
  • Supports real-time streaming decoding for text and speech.
  • Provides intermediate models (Ola-Image, Ola-Video) for custom model building.
  • Codebase is conducted on LLaVA.

Maintenance & Community

  • Active development with recent releases in February 2025.
  • Contact via GitHub issues or email (liuzuyan19@gmail.com).
  • Acknowledgements include LLaVA and VLMEvalKit teams.

Licensing & Compatibility

  • The repository does not explicitly state a license. The codebase is based on LLaVA, which is Apache 2.0 licensed. Further clarification on Ola's specific licensing is recommended for commercial use.

Limitations & Caveats

  • Evaluation code for omni-modal benchmarks is listed as "Coming Soon."
  • Requires manual download and placement of specific audio encoder weights for audio processing.
Health Check
Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
3
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.