LongCat-Flash-Omni  by meituan-longcat

Omni-modal AI for real-time audio-visual interaction

Created 1 month ago
425 stars

Top 69.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

LongCat-Flash-Omni is a 560B parameter (27B activated) open-source omni-modal model designed for state-of-the-art real-time audio-visual interaction. It integrates comprehensive multimodal understanding with low-latency audio processing, benefiting researchers and developers in multimodal AI.

How It Works

The model utilizes a Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, augmented by efficient multimodal perception and speech reconstruction modules. A curriculum-inspired progressive training strategy and an early-fusion paradigm ensure strong omni-modal capabilities without unimodal degradation. Modality-Decoupled Parallelism enhances training efficiency for large-scale multimodal tasks.

Quick Start & Requirements

  • Installation: Requires cloning the SGLang development branch (https://github.com/XiaoBin1992/sglang.git) and installing it, followed by cloning the demo repository (https://github.com/meituan-longcat/LongCat-Flash-Omni) and installing its requirements.
  • Prerequisites: Python >= 3.10.0, PyTorch >= 2.8, CUDA >= 12.9.
  • Hardware: Minimum 8×H20-141G GPUs for FP8 weights, or 16×H800-80G GPUs for BF16 weights.
  • Resources: Model weights can be downloaded via Hugging Face (meituan-longcat/LongCat-Flash-Omni) or huggingface-cli. Demo available at https://github.com/meituan-longcat/LongCat-Flash-Omni. Web interaction at https://longcat.ai.

Highlighted Details

  • SOTA Omni-Modal Performance: Achieves state-of-the-art cross-modal comprehension, with strong benchmark scores (e.g., OmniBench 61.38, MMBench-EN 87.5) competitive with leading models.
  • Low-Latency Real-time Interaction: Features a 128K token context window and efficient audio-visual processing for high-quality streaming speech generation.
  • Advanced Audio Capabilities: Demonstrates strong Automatic Speech Recognition (ASR) performance (e.g., LibriSpeech CER/WER 1.57/4.01) and robust audio understanding.
  • Efficient Training Infrastructure: Employs Modality-Decoupled Parallelism for efficient large-scale multimodal training.

Maintenance & Community

  • Contact: longcat-team@meituan.com
  • Community: WeChat Group.
  • Sponsorship: Supported by Meituan.

Licensing & Compatibility

  • License: MIT License for model weights and contributions.
  • Restrictions: Does not grant rights to Meituan trademarks or patents.

Limitations & Caveats

The model is not exhaustively evaluated for all downstream applications. Developers must consider LLM limitations (accuracy, safety, fairness) and comply with relevant laws and regulations. The web version currently supports only audio interaction, and the iOS app is limited to the Chinese App Store.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
10
Star History
421 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.