Ovis  by AIDC-AI

MLLM architecture aligning visual/textual embeddings

Created 1 year ago
1,348 stars

Top 29.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Ovis is a novel Multimodal Large Language Model (MLLM) architecture designed for structurally aligning visual and textual embeddings. It offers a range of model sizes and configurations, targeting researchers and developers working on advanced vision-language tasks. The project provides a flexible framework for integrating various vision encoders with different LLMs, enabling state-of-the-art performance across multiple benchmarks.

How It Works

Ovis employs a "structural embedding alignment" approach, which aims to create a more coherent and semantically rich representation by aligning the structural properties of visual and textual embeddings. This method is designed to enhance the model's understanding and reasoning capabilities across modalities, leading to improved performance on complex multimodal tasks.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment with Python 3.10, install dependencies via pip install -r requirements.txt, and then install Ovis with pip install -e ..
  • Prerequisites: Python 3.10, Torch 2.4.0, Transformers 4.46.2, DeepSpeed 0.15.4.
  • Resources: Model weights are available on Huggingface. Inference can be run via a Python script or a Gradio web UI.
  • Links: Models, Demo

Highlighted Details

  • Offers a variety of model sizes from 1B to 34B parameters, including quantized versions (GPTQ-Int4, Int8).
  • Achieves competitive performance across benchmarks like MMBench, MMStar, MMMU, and MathVista.
  • Supports advanced features like video and multi-image processing, multilingual OCR, and high-resolution image handling.
  • Integrates with popular LLMs such as Qwen2.5 and Gemma2.

Maintenance & Community

  • Developed by the Alibaba Ovis team.
  • The project has seen frequent updates and releases, indicating active development.

Licensing & Compatibility

  • Licensed under the Apache License, Version 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The disclaimer notes that while compliance-checking algorithms were used, the model cannot be guaranteed to be completely free of copyright issues or improper content due to data complexity and diverse usage scenarios.

Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
2
Issues (30d)
17
Star History
203 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.