LongCat-Next by meituan-longcat

Native multimodal model processing text, vision, and audio

Created 3 months ago

464 stars

Top 64.6% on SourcePulse

Project Summary

A3B-sized multimodal model LongCat-Next processes text, vision, and audio under a single autoregressive objective, aiming to overcome native multimodality barriers by treating vision and audio as extensions of language. It offers a unified solution for multimodal understanding and generation, targeting researchers and engineers seeking industrial-strength performance in a discrete framework.

How It Works

LongCat-Next introduces the Discrete Native Autoregression Paradigm (DiNA), extending next-token prediction to diverse modalities within a shared discrete token space. It employs Semantic-and-Aligned Encoders (SAE) with Residual Vector Quantization (RVQ) for semantically complete discrete visual representations, preserving both abstraction and detail. The Discrete Native-Resolution Vision Transformer (dNaViT) acts as a flexible, unified discrete interface for vision, extracting "visual words" that integrate seamlessly with large language models. This approach simplifies multimodal modeling, leverages existing LLM training infrastructure, and unifies understanding and generation tasks without performance compromise.

Quick Start & Requirements

Hardware: Minimum 3 GPUs with 80GB VRAM each (e.g., NVIDIA H100/A100 80GB).
Software: Python >= 3.10, Torch >= 2.6, Transformers >= 4.57.6, Accelerate >= 1.10.0. Requires ffmpeg<7 and soundfile==0.13.1.
Installation:
1. conda env create -f environment.yml -v
2. pip install -r requirements.txt && pip install -r requirements-post.txt --no-build-isolation
Links: Technical Report: https://arxiv.org/abs/2603.27538. Deployment support available via meituan-longcat/LongCat-Next-inference.

Highlighted Details

Achieves industrial-strength performance within discrete frameworks.
Surpasses previous discrete vision modeling performance ceilings on understanding tasks.
Excels across seeing, creating, and talking capabilities within a unified model.
Demonstrates strong generative quality, even with a 28x compression ratio for text rendering.
Offers competitive performance in advanced speech comprehension, low-latency voice conversation, and customizable voice cloning.

Maintenance & Community

Contact is available via longcat-team@meituan.com or by opening an issue. A WeChat Group is also mentioned for community interaction.

Licensing & Compatibility

The model weights and source code are released under the MIT License. This license is permissive for commercial use and closed-source linking but does not grant rights to use Meituan trademarks or patents.

Limitations & Caveats

The model has not been exhaustively evaluated for all potential downstream applications. Users should be aware of general large language model limitations, including performance variations across languages, and must independently assess accuracy, safety, and fairness before deployment in sensitive contexts. Compliance with all applicable laws and regulations is the responsibility of the developer and downstream user.

LongCat-Next by meituan-longcat

Explore Similar Projects

OneCAT by onecat-ai

MM-Interleaved by OpenGVLab

mammothmoda by bytedance

Cheers by AI9Stars

NextStep-1 by stepfun-ai

LaVIT by jy0205

Liquid by FoundationVision

Awesome-Unified-Multimodal-Models by showlab

InternLM-XComposer by InternLM

SenseNova-U1 by OpenSenseNova

Bagel by ByteDance-Seed

Janus by deepseek-ai