Visual-Chinese-LLaMA-Alpaca  by airaria

Multimodal model for Chinese visual understanding and instruction following

created 2 years ago
451 stars

Top 67.8% on sourcepulse

GitHubView on GitHub
Project Summary

Visual-Chinese-LLaMA-Alpaca (VisualCLA) is a multimodal large language model designed for Chinese language understanding and generation, integrating visual information with text. It targets researchers and developers working with Chinese NLP and computer vision, offering enhanced capabilities for tasks requiring visual context.

How It Works

VisualCLA builds upon the Chinese-LLaMA-Alpaca model by incorporating a Vision Encoder (CLIP-ViT-L/14) and a Resampler module. The Vision Encoder processes images into sequential representations, which are then downsampled by the Resampler using learnable queries. These visual features are projected to the LLM's dimension and concatenated with text inputs before being processed by the LLaMA backbone. This architecture allows the model to understand and respond to multimodal instructions, trained first on image captioning and then fine-tuned on a diverse set of multimodal instruction datasets.

Quick Start & Requirements

  • Installation: pip install -e . after cloning the repository.
  • Dependencies: Requires base models: Chinese-Alpaca-Plus 7B (HF format) and CLIP-ViT-L/14.
  • Setup: Model weights are provided as incremental LoRA weights, requiring merging with base models. A Colab notebook is available for easy setup and inference.
  • Resources: Merging models requires ~20GB RAM. Inference can be run with load_in_8bit=True.
  • Links: Colab Notebook, Model Hub

Highlighted Details

  • Based on Chinese-LLaMA-Alpaca, enhanced with multimodal capabilities.
  • Provides inference code and deployment scripts for Gradio/Text-Generation-WebUI.
  • Includes translated Chinese versions of LLaVA and OwlEval test sets.
  • Offers incremental weights for LLaMA and CLIP, adhering to LLaMA's non-commercial license.

Maintenance & Community

  • The project is maintained by individuals and collaborators in their spare time.
  • No specific community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The project itself appears to be under a permissive license, but explicitly states that LLaMA models are prohibited from commercial use.
  • Users must adhere to the licenses of the base models (LLaMA, CLIP) and any third-party code used.

Limitations & Caveats

The VisualCLA-7B-v0.1 is a test version with limitations including hallucination issues, potential errors in instruction following, lower accuracy on fine-grained text/formulas/tables in images, and degraded output quality during extended multi-turn conversations. There is no interactive online demo available.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.