MLLM for referential dialogue using spatial coordinates
Top 45.5% on sourcepulse
Shikra is a multimodal large language model (MLLM) designed for referential dialogue, enabling precise spatial coordinate input and output in natural language without requiring additional vocabularies, position encoders, or external models. It targets researchers and developers working with visually-grounded conversational AI and offers a novel approach to integrating spatial understanding into LLMs.
How It Works
Shikra builds upon the LLaMA architecture, integrating visual information to facilitate referential dialogue. Its core innovation lies in its ability to handle spatial coordinates directly within natural language interactions, allowing users to refer to specific objects or locations within an image and receive precise coordinate-based responses. This approach avoids complex pre-processing or post-detection steps, simplifying the pipeline for spatial reasoning in multimodal contexts.
Quick Start & Requirements
pip install -r requirements.txt
.python mllm/demo/webdemo.py --model_path /path/to/shikra/ckpt
Highlighted Details
accelerate
for distributed setups.Maintenance & Community
The project released code, data, and checkpoints in July 2023. It acknowledges contributions from LLaVA, Vicuna, ChatGLM-Efficient-Tuning, and GLIGEN.
Licensing & Compatibility
Shikra weights are released as delta weights, requiring users to obtain original LLaMA weights. This implies compatibility is tied to the LLaMA license, which may have restrictions on commercial use.
Limitations & Caveats
The model requires obtaining and applying delta weights to the base LLaMA model, adding an extra step for users. Performance may be impacted when using 8-bit quantization.
1 year ago
1+ week