shikra by shikras

MLLM for referential dialogue using spatial coordinates

Created 2 years ago

801 stars

Top 44.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

Shikra is a multimodal large language model (MLLM) designed for referential dialogue, enabling precise spatial coordinate input and output in natural language without requiring additional vocabularies, position encoders, or external models. It targets researchers and developers working with visually-grounded conversational AI and offers a novel approach to integrating spatial understanding into LLMs.

How It Works

Shikra builds upon the LLaMA architecture, integrating visual information to facilitate referential dialogue. Its core innovation lies in its ability to handle spatial coordinates directly within natural language interactions, allowing users to refer to specific objects or locations within an image and receive precise coordinate-based responses. This approach avoids complex pre-processing or post-detection steps, simplifying the pipeline for spatial reasoning in multimodal contexts.

Quick Start & Requirements

Install: Clone the repository and set up a Python 3.10 environment using Conda. Install dependencies via pip install -r requirements.txt.
Prerequisites: Requires original LLaMA weights. A GPU with at least 16GB of VRAM is recommended for the Gradio demo (8-bit quantization is an option with performance trade-offs).
Setup: Applying delta weights to LLaMA requires downloading the base model weights.
Demo: Gradio web demo: python mllm/demo/webdemo.py --model_path /path/to/shikra/ckpt
More Info: Paper, Hugging Face

Highlighted Details

Enables referential dialogue with spatial coordinate input/output.
Achieves this without extra vocabularies, position encoders, or pre/post-detection.
Released as delta weights to comply with LLaMA license.
Supports training and inference using accelerate for distributed setups.

Maintenance & Community

The project released code, data, and checkpoints in July 2023. It acknowledges contributions from LLaVA, Vicuna, ChatGLM-Efficient-Tuning, and GLIGEN.

Licensing & Compatibility

Shikra weights are released as delta weights, requiring users to obtain original LLaMA weights. This implies compatibility is tied to the LLaMA license, which may have restrictions on commercial use.

Limitations & Caveats

The model requires obtaining and applying delta weights to the base LLaMA model, adding an extra step for users. Performance may be impacted when using 8-bit quantization.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days