Ovis  by AIDC-AI

MLLM architecture aligning visual/textual embeddings

created 1 year ago
998 stars

Top 37.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Ovis is a novel Multimodal Large Language Model (MLLM) architecture designed for structurally aligning visual and textual embeddings. It offers a range of model sizes and configurations, targeting researchers and developers working on advanced vision-language tasks. The project provides a flexible framework for integrating various vision encoders with different LLMs, enabling state-of-the-art performance across multiple benchmarks.

How It Works

Ovis employs a "structural embedding alignment" approach, which aims to create a more coherent and semantically rich representation by aligning the structural properties of visual and textual embeddings. This method is designed to enhance the model's understanding and reasoning capabilities across modalities, leading to improved performance on complex multimodal tasks.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment with Python 3.10, install dependencies via pip install -r requirements.txt, and then install Ovis with pip install -e ..
  • Prerequisites: Python 3.10, Torch 2.4.0, Transformers 4.46.2, DeepSpeed 0.15.4.
  • Resources: Model weights are available on Huggingface. Inference can be run via a Python script or a Gradio web UI.
  • Links: Models, Demo

Highlighted Details

  • Offers a variety of model sizes from 1B to 34B parameters, including quantized versions (GPTQ-Int4, Int8).
  • Achieves competitive performance across benchmarks like MMBench, MMStar, MMMU, and MathVista.
  • Supports advanced features like video and multi-image processing, multilingual OCR, and high-resolution image handling.
  • Integrates with popular LLMs such as Qwen2.5 and Gemma2.

Maintenance & Community

  • Developed by the Alibaba Ovis team.
  • The project has seen frequent updates and releases, indicating active development.

Licensing & Compatibility

  • Licensed under the Apache License, Version 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The disclaimer notes that while compliance-checking algorithms were used, the model cannot be guaranteed to be completely free of copyright issues or improper content due to data complexity and diverse usage scenarios.

Health Check
Last commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
110 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.