Multimodal model for Chinese visual understanding and instruction following
Top 67.8% on sourcepulse
Visual-Chinese-LLaMA-Alpaca (VisualCLA) is a multimodal large language model designed for Chinese language understanding and generation, integrating visual information with text. It targets researchers and developers working with Chinese NLP and computer vision, offering enhanced capabilities for tasks requiring visual context.
How It Works
VisualCLA builds upon the Chinese-LLaMA-Alpaca model by incorporating a Vision Encoder (CLIP-ViT-L/14) and a Resampler module. The Vision Encoder processes images into sequential representations, which are then downsampled by the Resampler using learnable queries. These visual features are projected to the LLM's dimension and concatenated with text inputs before being processed by the LLaMA backbone. This architecture allows the model to understand and respond to multimodal instructions, trained first on image captioning and then fine-tuned on a diverse set of multimodal instruction datasets.
Quick Start & Requirements
pip install -e .
after cloning the repository.load_in_8bit=True
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The VisualCLA-7B-v0.1 is a test version with limitations including hallucination issues, potential errors in instruction following, lower accuracy on fine-grained text/formulas/tables in images, and degraded output quality during extended multi-turn conversations. There is no interactive online demo available.
2 years ago
1 day