Discover and explore top open-source AI tools and projects—updated daily.
meituan-longcatOmni-modal AI for real-time audio-visual interaction
Top 69.4% on SourcePulse
Summary
LongCat-Flash-Omni is a 560B parameter (27B activated) open-source omni-modal model designed for state-of-the-art real-time audio-visual interaction. It integrates comprehensive multimodal understanding with low-latency audio processing, benefiting researchers and developers in multimodal AI.
How It Works
The model utilizes a Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, augmented by efficient multimodal perception and speech reconstruction modules. A curriculum-inspired progressive training strategy and an early-fusion paradigm ensure strong omni-modal capabilities without unimodal degradation. Modality-Decoupled Parallelism enhances training efficiency for large-scale multimodal tasks.
Quick Start & Requirements
https://github.com/XiaoBin1992/sglang.git) and installing it, followed by cloning the demo repository (https://github.com/meituan-longcat/LongCat-Flash-Omni) and installing its requirements.meituan-longcat/LongCat-Flash-Omni) or huggingface-cli. Demo available at https://github.com/meituan-longcat/LongCat-Flash-Omni. Web interaction at https://longcat.ai.Highlighted Details
Maintenance & Community
longcat-team@meituan.comLicensing & Compatibility
Limitations & Caveats
The model is not exhaustively evaluated for all downstream applications. Developers must consider LLM limitations (accuracy, safety, fairness) and comply with relevant laws and regulations. The web version currently supports only audio interaction, and the iOS app is limited to the Chinese App Store.
5 days ago
Inactive