cobra  by h-zhao1997

Multimodal LLM research paper extending Mamba for efficient inference

created 1 year ago
285 stars

Top 92.8% on sourcepulse

GitHubView on GitHub
Project Summary

Cobra extends Mamba-based Large Language Models (LLMs) to the multi-modal domain, enabling efficient inference for vision-language tasks. It targets researchers and developers working with multi-modal AI, offering a Mamba-native architecture for potentially faster and more memory-efficient processing compared to Transformer-based models.

How It Works

Cobra integrates a vision encoder with a Mamba-based LLM, leveraging Mamba's selective state space model for efficient sequence processing. This approach aims to overcome the quadratic complexity of traditional Transformers in attention mechanisms, particularly beneficial for long sequences and multi-modal inputs. The architecture is designed for efficient inference, as highlighted by its AAAI-25 acceptance.

Quick Start & Requirements

  • Installation: pip install -e . after cloning the repository. Requires packaging, ninja, and mamba-ssm<2.0.0.
  • Prerequisites: Python >= 3.8, PyTorch >= 2.1, Torchvision >= 0.16.0. CUDA support is recommended for GPU acceleration.
  • Usage: Load pretrained models via cobra.load() and generate text from images and prompts.
  • Resources: Pretrained models are available on the Hugging Face Hub. Training requires significant computational resources (e.g., 8 GPUs).
  • Links: Demo, Evaluation Code, Training Scripts

Highlighted Details

  • Accepted to AAAI-25.
  • Demonstrates generation speed advantages over LLaVA v1.5.
  • Supports Mamba-2.8B and vision backbones like dinosiglip-vit-so-384px.
  • Provides scripts for dataset preprocessing and model training.

Maintenance & Community

The project was released in March 2024 and has seen recent updates for prompt format fixes. It builds upon established projects like LLaVA and Hugging Face Transformers.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project is relatively new, with the core code and pretrained models released in March 2024. While promising efficiency gains, extensive real-world performance benchmarks and comparisons across a wider range of tasks and hardware configurations would be beneficial.

Health Check
Last commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.