cobra  by h-zhao1997

Multimodal LLM research paper extending Mamba for efficient inference

Created 1 year ago
288 stars

Top 91.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Cobra extends Mamba-based Large Language Models (LLMs) to the multi-modal domain, enabling efficient inference for vision-language tasks. It targets researchers and developers working with multi-modal AI, offering a Mamba-native architecture for potentially faster and more memory-efficient processing compared to Transformer-based models.

How It Works

Cobra integrates a vision encoder with a Mamba-based LLM, leveraging Mamba's selective state space model for efficient sequence processing. This approach aims to overcome the quadratic complexity of traditional Transformers in attention mechanisms, particularly beneficial for long sequences and multi-modal inputs. The architecture is designed for efficient inference, as highlighted by its AAAI-25 acceptance.

Quick Start & Requirements

  • Installation: pip install -e . after cloning the repository. Requires packaging, ninja, and mamba-ssm<2.0.0.
  • Prerequisites: Python >= 3.8, PyTorch >= 2.1, Torchvision >= 0.16.0. CUDA support is recommended for GPU acceleration.
  • Usage: Load pretrained models via cobra.load() and generate text from images and prompts.
  • Resources: Pretrained models are available on the Hugging Face Hub. Training requires significant computational resources (e.g., 8 GPUs).
  • Links: Demo, Evaluation Code, Training Scripts

Highlighted Details

  • Accepted to AAAI-25.
  • Demonstrates generation speed advantages over LLaVA v1.5.
  • Supports Mamba-2.8B and vision backbones like dinosiglip-vit-so-384px.
  • Provides scripts for dataset preprocessing and model training.

Maintenance & Community

The project was released in March 2024 and has seen recent updates for prompt format fixes. It builds upon established projects like LLaVA and Hugging Face Transformers.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project is relatively new, with the core code and pretrained models released in March 2024. While promising efficiency gains, extensive real-world performance benchmarks and comparisons across a wider range of tasks and hardware configurations would be beneficial.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.