Visual-Chinese-LLaMA-Alpaca  by airaria

Multimodal model for Chinese visual understanding and instruction following

Created 2 years ago
451 stars

Top 66.8% on SourcePulse

GitHubView on GitHub
Project Summary

Visual-Chinese-LLaMA-Alpaca (VisualCLA) is a multimodal large language model designed for Chinese language understanding and generation, integrating visual information with text. It targets researchers and developers working with Chinese NLP and computer vision, offering enhanced capabilities for tasks requiring visual context.

How It Works

VisualCLA builds upon the Chinese-LLaMA-Alpaca model by incorporating a Vision Encoder (CLIP-ViT-L/14) and a Resampler module. The Vision Encoder processes images into sequential representations, which are then downsampled by the Resampler using learnable queries. These visual features are projected to the LLM's dimension and concatenated with text inputs before being processed by the LLaMA backbone. This architecture allows the model to understand and respond to multimodal instructions, trained first on image captioning and then fine-tuned on a diverse set of multimodal instruction datasets.

Quick Start & Requirements

  • Installation: pip install -e . after cloning the repository.
  • Dependencies: Requires base models: Chinese-Alpaca-Plus 7B (HF format) and CLIP-ViT-L/14.
  • Setup: Model weights are provided as incremental LoRA weights, requiring merging with base models. A Colab notebook is available for easy setup and inference.
  • Resources: Merging models requires ~20GB RAM. Inference can be run with load_in_8bit=True.
  • Links: Colab Notebook, Model Hub

Highlighted Details

  • Based on Chinese-LLaMA-Alpaca, enhanced with multimodal capabilities.
  • Provides inference code and deployment scripts for Gradio/Text-Generation-WebUI.
  • Includes translated Chinese versions of LLaVA and OwlEval test sets.
  • Offers incremental weights for LLaMA and CLIP, adhering to LLaMA's non-commercial license.

Maintenance & Community

  • The project is maintained by individuals and collaborators in their spare time.
  • No specific community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The project itself appears to be under a permissive license, but explicitly states that LLaMA models are prohibited from commercial use.
  • Users must adhere to the licenses of the base models (LLaMA, CLIP) and any third-party code used.

Limitations & Caveats

The VisualCLA-7B-v0.1 is a test version with limitations including hallucination issues, potential errors in instruction following, lower accuracy on fine-grained text/formulas/tables in images, and degraded output quality during extended multi-turn conversations. There is no interactive online demo available.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.