LongCat-Next  by meituan-longcat

Native multimodal model processing text, vision, and audio

Created 2 weeks ago

New!

382 stars

Top 74.6% on SourcePulse

GitHubView on GitHub
Project Summary

A3B-sized multimodal model LongCat-Next processes text, vision, and audio under a single autoregressive objective, aiming to overcome native multimodality barriers by treating vision and audio as extensions of language. It offers a unified solution for multimodal understanding and generation, targeting researchers and engineers seeking industrial-strength performance in a discrete framework.

How It Works

LongCat-Next introduces the Discrete Native Autoregression Paradigm (DiNA), extending next-token prediction to diverse modalities within a shared discrete token space. It employs Semantic-and-Aligned Encoders (SAE) with Residual Vector Quantization (RVQ) for semantically complete discrete visual representations, preserving both abstraction and detail. The Discrete Native-Resolution Vision Transformer (dNaViT) acts as a flexible, unified discrete interface for vision, extracting "visual words" that integrate seamlessly with large language models. This approach simplifies multimodal modeling, leverages existing LLM training infrastructure, and unifies understanding and generation tasks without performance compromise.

Quick Start & Requirements

  • Hardware: Minimum 3 GPUs with 80GB VRAM each (e.g., NVIDIA H100/A100 80GB).
  • Software: Python >= 3.10, Torch >= 2.6, Transformers >= 4.57.6, Accelerate >= 1.10.0. Requires ffmpeg<7 and soundfile==0.13.1.
  • Installation:
    1. conda env create -f environment.yml -v
    2. pip install -r requirements.txt && pip install -r requirements-post.txt --no-build-isolation
  • Links: Technical Report: https://arxiv.org/abs/2603.27538. Deployment support available via meituan-longcat/LongCat-Next-inference.

Highlighted Details

  • Achieves industrial-strength performance within discrete frameworks.
  • Surpasses previous discrete vision modeling performance ceilings on understanding tasks.
  • Excels across seeing, creating, and talking capabilities within a unified model.
  • Demonstrates strong generative quality, even with a 28x compression ratio for text rendering.
  • Offers competitive performance in advanced speech comprehension, low-latency voice conversation, and customizable voice cloning.

Maintenance & Community

Contact is available via longcat-team@meituan.com or by opening an issue. A WeChat Group is also mentioned for community interaction.

Licensing & Compatibility

The model weights and source code are released under the MIT License. This license is permissive for commercial use and closed-source linking but does not grant rights to use Meituan trademarks or patents.

Limitations & Caveats

The model has not been exhaustively evaluated for all potential downstream applications. Users should be aware of general large language model limitations, including performance variations across languages, and must independently assess accuracy, safety, and fairness before deployment in sensitive contexts. Compliance with all applicable laws and regulations is the responsibility of the developer and downstream user.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
9
Star History
384 stars in the last 17 days

Explore Similar Projects

Feedback? Help us improve.