Discover and explore top open-source AI tools and projects—updated daily.
meituan-longcatNative multimodal model processing text, vision, and audio
New!
Top 74.6% on SourcePulse
A3B-sized multimodal model LongCat-Next processes text, vision, and audio under a single autoregressive objective, aiming to overcome native multimodality barriers by treating vision and audio as extensions of language. It offers a unified solution for multimodal understanding and generation, targeting researchers and engineers seeking industrial-strength performance in a discrete framework.
How It Works
LongCat-Next introduces the Discrete Native Autoregression Paradigm (DiNA), extending next-token prediction to diverse modalities within a shared discrete token space. It employs Semantic-and-Aligned Encoders (SAE) with Residual Vector Quantization (RVQ) for semantically complete discrete visual representations, preserving both abstraction and detail. The Discrete Native-Resolution Vision Transformer (dNaViT) acts as a flexible, unified discrete interface for vision, extracting "visual words" that integrate seamlessly with large language models. This approach simplifies multimodal modeling, leverages existing LLM training infrastructure, and unifies understanding and generation tasks without performance compromise.
Quick Start & Requirements
ffmpeg<7 and soundfile==0.13.1.conda env create -f environment.yml -vpip install -r requirements.txt && pip install -r requirements-post.txt --no-build-isolationhttps://arxiv.org/abs/2603.27538. Deployment support available via meituan-longcat/LongCat-Next-inference.Highlighted Details
Maintenance & Community
Contact is available via longcat-team@meituan.com or by opening an issue. A WeChat Group is also mentioned for community interaction.
Licensing & Compatibility
The model weights and source code are released under the MIT License. This license is permissive for commercial use and closed-source linking but does not grant rights to use Meituan trademarks or patents.
Limitations & Caveats
The model has not been exhaustively evaluated for all potential downstream applications. Users should be aware of general large language model limitations, including performance variations across languages, and must independently assess accuracy, safety, and fairness before deployment in sensitive contexts. Compliance with all applicable laws and regulations is the responsibility of the developer and downstream user.
1 day ago
Inactive
InternLM