shikra  by shikras

MLLM for referential dialogue using spatial coordinates

created 2 years ago
786 stars

Top 45.5% on sourcepulse

GitHubView on GitHub
Project Summary

Shikra is a multimodal large language model (MLLM) designed for referential dialogue, enabling precise spatial coordinate input and output in natural language without requiring additional vocabularies, position encoders, or external models. It targets researchers and developers working with visually-grounded conversational AI and offers a novel approach to integrating spatial understanding into LLMs.

How It Works

Shikra builds upon the LLaMA architecture, integrating visual information to facilitate referential dialogue. Its core innovation lies in its ability to handle spatial coordinates directly within natural language interactions, allowing users to refer to specific objects or locations within an image and receive precise coordinate-based responses. This approach avoids complex pre-processing or post-detection steps, simplifying the pipeline for spatial reasoning in multimodal contexts.

Quick Start & Requirements

  • Install: Clone the repository and set up a Python 3.10 environment using Conda. Install dependencies via pip install -r requirements.txt.
  • Prerequisites: Requires original LLaMA weights. A GPU with at least 16GB of VRAM is recommended for the Gradio demo (8-bit quantization is an option with performance trade-offs).
  • Setup: Applying delta weights to LLaMA requires downloading the base model weights.
  • Demo: Gradio web demo: python mllm/demo/webdemo.py --model_path /path/to/shikra/ckpt
  • More Info: Paper, Hugging Face

Highlighted Details

  • Enables referential dialogue with spatial coordinate input/output.
  • Achieves this without extra vocabularies, position encoders, or pre/post-detection.
  • Released as delta weights to comply with LLaMA license.
  • Supports training and inference using accelerate for distributed setups.

Maintenance & Community

The project released code, data, and checkpoints in July 2023. It acknowledges contributions from LLaVA, Vicuna, ChatGLM-Efficient-Tuning, and GLIGEN.

Licensing & Compatibility

Shikra weights are released as delta weights, requiring users to obtain original LLaMA weights. This implies compatibility is tied to the LLaMA license, which may have restrictions on commercial use.

Limitations & Caveats

The model requires obtaining and applying delta weights to the base LLaMA model, adding an extra step for users. Performance may be impacted when using 8-bit quantization.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
3 more.

LLaMA-Adapter by OpenGVLab

0.0%
6k
Efficient fine-tuning for instruction-following LLaMA models
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
9 more.

alpaca-lora by tloen

0.0%
19k
LoRA fine-tuning for LLaMA
created 2 years ago
updated 1 year ago
Feedback? Help us improve.