Visual-Chinese-LLaMA-Alpaca by airaria

Multimodal model for Chinese visual understanding and instruction following

Created 2 years ago

460 stars

Top 65.8% on SourcePulse

Project Summary

Visual-Chinese-LLaMA-Alpaca (VisualCLA) is a multimodal large language model designed for Chinese language understanding and generation, integrating visual information with text. It targets researchers and developers working with Chinese NLP and computer vision, offering enhanced capabilities for tasks requiring visual context.

How It Works

VisualCLA builds upon the Chinese-LLaMA-Alpaca model by incorporating a Vision Encoder (CLIP-ViT-L/14) and a Resampler module. The Vision Encoder processes images into sequential representations, which are then downsampled by the Resampler using learnable queries. These visual features are projected to the LLM's dimension and concatenated with text inputs before being processed by the LLaMA backbone. This architecture allows the model to understand and respond to multimodal instructions, trained first on image captioning and then fine-tuned on a diverse set of multimodal instruction datasets.

Quick Start & Requirements

Installation: pip install -e . after cloning the repository.
Dependencies: Requires base models: Chinese-Alpaca-Plus 7B (HF format) and CLIP-ViT-L/14.
Setup: Model weights are provided as incremental LoRA weights, requiring merging with base models. A Colab notebook is available for easy setup and inference.
Resources: Merging models requires ~20GB RAM. Inference can be run with load_in_8bit=True.
Links: Colab Notebook, Model Hub

Highlighted Details

Based on Chinese-LLaMA-Alpaca, enhanced with multimodal capabilities.
Provides inference code and deployment scripts for Gradio/Text-Generation-WebUI.
Includes translated Chinese versions of LLaVA and OwlEval test sets.
Offers incremental weights for LLaMA and CLIP, adhering to LLaMA's non-commercial license.

Maintenance & Community

The project is maintained by individuals and collaborators in their spare time.
No specific community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

The project itself appears to be under a permissive license, but explicitly states that LLaMA models are prohibited from commercial use.
Users must adhere to the licenses of the base models (LLaMA, CLIP) and any third-party code used.

Limitations & Caveats

The VisualCLA-7B-v0.1 is a test version with limitations including hallucination issues, potential errors in instruction following, lower accuracy on fine-grained text/formulas/tables in images, and degraded output quality during extended multi-turn conversations. There is no interactive online demo available.

Visual-Chinese-LLaMA-Alpaca by airaria

Explore Similar Projects

KoLLaVA by tabtoyou

Lumina-mGPT by Alpha-VLLM

Chinese-LLaVA by LinkSoul-AI

ScreenAI by kyegomez

gill by kohjingyu

CLIP-Chinese by yangjianxin1

PandaGPT by yxuansu

Emu3 by baaivision

Video-LLaMA by DAMO-NLP-SG

VisualGLM-6B by zai-org

DeepSeek-VL2 by deepseek-ai

Janus by deepseek-ai