Ovis by AIDC-AI

MLLM architecture aligning visual/textual embeddings

Created 1 year ago

1,426 stars

Top 28.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

Ovis is a novel Multimodal Large Language Model (MLLM) architecture designed for structurally aligning visual and textual embeddings. It offers a range of model sizes and configurations, targeting researchers and developers working on advanced vision-language tasks. The project provides a flexible framework for integrating various vision encoders with different LLMs, enabling state-of-the-art performance across multiple benchmarks.

How It Works

Ovis employs a "structural embedding alignment" approach, which aims to create a more coherent and semantically rich representation by aligning the structural properties of visual and textual embeddings. This method is designed to enhance the model's understanding and reasoning capabilities across modalities, leading to improved performance on complex multimodal tasks.

Quick Start & Requirements

Install: Clone the repository, create a conda environment with Python 3.10, install dependencies via pip install -r requirements.txt, and then install Ovis with pip install -e ..
Prerequisites: Python 3.10, Torch 2.4.0, Transformers 4.46.2, DeepSpeed 0.15.4.
Resources: Model weights are available on Huggingface. Inference can be run via a Python script or a Gradio web UI.
Links: Models, Demo

Highlighted Details

Offers a variety of model sizes from 1B to 34B parameters, including quantized versions (GPTQ-Int4, Int8).
Achieves competitive performance across benchmarks like MMBench, MMStar, MMMU, and MathVista.
Supports advanced features like video and multi-image processing, multilingual OCR, and high-resolution image handling.
Integrates with popular LLMs such as Qwen2.5 and Gemma2.

Maintenance & Community

Developed by the Alibaba Ovis team.
The project has seen frequent updates and releases, indicating active development.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The disclaimer notes that while compliance-checking algorithms were used, the model cannot be guaranteed to be completely free of copyright issues or improper content due to data complexity and diverse usage scenarios.

Health Check

Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days