cobra by h-zhao1997

Multimodal LLM research paper extending Mamba for efficient inference

Created 1 year ago

291 stars

Top 90.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

Cobra extends Mamba-based Large Language Models (LLMs) to the multi-modal domain, enabling efficient inference for vision-language tasks. It targets researchers and developers working with multi-modal AI, offering a Mamba-native architecture for potentially faster and more memory-efficient processing compared to Transformer-based models.

How It Works

Cobra integrates a vision encoder with a Mamba-based LLM, leveraging Mamba's selective state space model for efficient sequence processing. This approach aims to overcome the quadratic complexity of traditional Transformers in attention mechanisms, particularly beneficial for long sequences and multi-modal inputs. The architecture is designed for efficient inference, as highlighted by its AAAI-25 acceptance.

Quick Start & Requirements

Installation: pip install -e . after cloning the repository. Requires packaging, ninja, and mamba-ssm<2.0.0.
Prerequisites: Python >= 3.8, PyTorch >= 2.1, Torchvision >= 0.16.0. CUDA support is recommended for GPU acceleration.
Usage: Load pretrained models via cobra.load() and generate text from images and prompts.
Resources: Pretrained models are available on the Hugging Face Hub. Training requires significant computational resources (e.g., 8 GPUs).
Links: Demo, Evaluation Code, Training Scripts

Highlighted Details

Accepted to AAAI-25.
Demonstrates generation speed advantages over LLaVA v1.5.
Supports Mamba-2.8B and vision backbones like dinosiglip-vit-so-384px.
Provides scripts for dataset preprocessing and model training.

Maintenance & Community

The project was released in March 2024 and has seen recent updates for prompt format fixes. It builds upon established projects like LLaVA and Hugging Face Transformers.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project is relatively new, with the core code and pretrained models released in March 2024. While promising efficiency gains, extensive real-world performance benchmarks and comparisons across a wider range of tasks and hardware configurations would be beneficial.

cobra by h-zhao1997

Explore Similar Projects

dots.vlm1 by rednote-hilab

Awesome_Matching_Pretraining_Transfering by Paranioar

magma by Aleph-Alpha-Research

PandaGPT by yxuansu

Show-o by showlab

molmo by allenai

Vary by Ucas-HaoranWei

smollm by huggingface

MiniMax-01 by MiniMax-AI

open_flamingo by mlfoundations

Janus by deepseek-ai

unilm by microsoft