bc-omni by westlake-baichuan-mllm

Open-source research paper for multimodal LLM

Created 1 year ago

272 stars

Top 94.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

Baichuan-Omni is an open-source omni-modal Large Language Model (LLM) designed to process and understand text, image, audio, and video concurrently, offering an advanced multimodal interactive experience. It aims to provide a high-performing, accessible alternative to proprietary models like GPT-4o for researchers and developers in the multimodal AI space.

How It Works

The model employs a two-phase training schema. Phase 1 involves Multimodal Alignment Pretraining, integrating Image-Language, Video-Language, and Audio-Language branches. It uses a visual encoder for image and video processing and Whisper-large-v3's audio encoder with a novel convolutional-gated MLP projector. Phase 2 focuses on Multimodal Supervised Fine-Tuning using over 600K multimodal instruction-following pairs across text, image, video, and audio, enhancing complex task execution and cross-modal understanding.

Quick Start & Requirements

Installation: Code and model weights are available via Hugging Face checkpoints. Specific installation commands are not detailed in the README.
Prerequisites: Requires significant computational resources for inference and fine-tuning. Specific hardware (e.g., GPU, CUDA versions) or software dependencies are not explicitly listed.
Resources: Links to Hugging Face checkpoints and the technical report are provided. Demo videos are linked but described as "coming soon."

Highlighted Details

First open-source omni-modal LLM supporting image, video, audio, and text processing.
End-to-end capability for both multimodal input and audio output.
Utilizes a novel convolutional-gated MLP projector for audio processing to retain more information.
Trained on a diverse dataset including image captioning, VQA, video-text pairs, and audio-text pairs.

Maintenance & Community

The project is associated with westlake-baichuan-mllm and Baichuan Inc. The README encourages stars and citations for the technical report. Links to Hugging Face for checkpoints and papers are provided. No specific community channels (Discord, Slack) or roadmap are mentioned.

Licensing & Compatibility

The README does not explicitly state the license type or any restrictions for commercial use or closed-source linking.

Limitations & Caveats

Demo videos are marked as "coming soon," indicating that interactive demonstrations are not yet available. Detailed requirements for setup and inference, such as specific hardware or software dependencies, are not provided in the README.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days