NExT-Chat by NExT-ChatV

Multimodal LLM for integrated vision tasks

Created 2 years ago

253 stars

Top 99.4% on SourcePulse

Project Summary

NExT-Chat is a Large Multimodal Model (LMM) designed to integrate conversational AI with visual understanding, specifically object detection and segmentation. It targets researchers and developers seeking to build multimodal applications capable of not only chatting but also precisely locating and segmenting objects within images, offering enhanced visual grounding for AI interactions.

How It Works

NExT-Chat functions as an LMM by combining a language model with visual encoders. It leverages OpenAI's CLIP ViT for visual feature extraction and the Segment Anything Model (SAM) for segmentation tasks. The framework employs a multi-stage training approach, encompassing VL+Detection Pre-training, VL+Detection Instruction Following, and VL+Detection+Segmentation, allowing for progressive integration of capabilities.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/NExT-ChatV/NExT-Chat.git), navigate into the directory, and install dependencies (pip install -r requirements.txt).
Prerequisites: Requires downloading and configuring paths for OpenAI CLIP ViT models (e.g., openai-clip-vit-large-patch14-336) and the SAM model. GPU acceleration is essential.
Hardware: GPU memory requirements range from approximately 24GB for nextchat-7b-224 to 32GB for nextchat-7b-336 and 35GB for nextchat-13b-224.
Resources: A project page with a demo is mentioned, and model weights are available on Huggingface.

Highlighted Details

Model Zoo: Offers 7B and 13B parameter models with varying ViT resolutions (224x224, 336x336), recommending the nextchat-7b-336-v1 for superior performance.
Capabilities: Supports advanced chat functionalities including object localization (e.g., "Where is XXX in the image?"), grounded captioning, and VQA with localization.
Training: Features a three-stage training pipeline and supports DeepSpeed for distributed training.
Evaluation: Includes evaluation scripts for tasks like Referring Expression Segmentation (RES), Referring Expression Comprehension (REC), Pope (Image-level Hallucination), and RefCOCOg (Region Caption).

Maintenance & Community

The initial code was released in December 2023. No specific community channels (like Discord or Slack) or details on maintainers/sponsorships are provided in the README.

Licensing & Compatibility

The provided README does not specify a software license. Compatibility for commercial use or integration with closed-source projects is undetermined without a license.

Limitations & Caveats

The project notes that its current implementation struggles to outperform top-tier pixel2seq models on Referring Expression Comprehension (REC) tasks in the pre-training setting, with ongoing research into this area. Older v0 model versions are explicitly marked as "not recommended" compared to newer iterations. The setup requires careful configuration of external model paths (CLIP, SAM).

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days