tuna-2 by facebookresearch

Pixel embeddings power unified multimodal understanding and generation

Created 2 months ago

732 stars

Top 46.3% on SourcePulse

Project Summary

Tuna-2 introduces a simplified Unified Multimodal Model (UMM) architecture that leverages direct pixel embeddings, eschewing traditional vision encoders. This approach aims to enhance performance and streamline multimodal understanding and generation tasks, offering a more efficient alternative for researchers and practitioners.

How It Works

Evolving from Tuna, Tuna-2 progressively strips away visual encoding components. It first removes the VAE to create Tuna-R, a pixel-space UMM. Tuna-2 further simplifies by bypassing the representation encoder entirely, using direct patch embedding layers for raw image inputs. This pixel-embedding-first strategy is presented as a novel and advantageous design for achieving superior results on multimodal benchmarks.

Quick Start & Requirements

Installation is streamlined via git clone and bash scripts/setup_uv.sh, which creates a .venv with dependencies. Manual setup using the uv package manager is also supported. Key requirements include Python, uv, and PyTorch with CUDA 12.1 (cu121). Inference is managed through a single script (scripts/launch/predict.sh), supporting various tasks, model variants (Tuna-2, Tuna-R, Tuna), sizes (7b, 2b), and resolutions. Project page: https://tuna-ai.org/tuna-2. arXiv: https://arxiv.org/abs/2604.24763.

Highlighted Details

Tuna-2, utilizing pixel embeddings, demonstrates superior performance over Tuna-R and Tuna on diverse multimodal benchmarks.
Supports flexible inference configurations including text-to-image (t2i) and image editing tasks.
Offers multiple model variants and sizes (e.g., 7b, 2b) to cater to different computational needs.
Accommodates a range of output resolutions, from 512x512 up to 1344x768.

Maintenance & Community

The project is backed by a large team of researchers from Meta and academic institutions, suggesting strong foundational support. However, the README does not provide direct links to community channels such as Discord or Slack, nor does it detail a public roadmap.

Licensing & Compatibility

The project is licensed under the Apache License 2.0. This permissive license allows for broad compatibility, including commercial use and integration into closed-source projects.

Limitations & Caveats

Due to organizational policy, full production-trained model weights are not being released. A foundation checkpoint with some layers removed will be provided, necessitating fine-tuning to achieve full model quality. Video generation capabilities are also not released, although the codebase is available for users to train their own models.

tuna-2 by facebookresearch

Explore Similar Projects

mammothmoda by bytedance

LLaVA-UHD by thunlp

UltraPixel by catcathh

UniWorld by PKU-YuanGroup

comfyui_HiDream-Sampler by lum3on

ComfyUI-DyPE by wildminder

Lumina-mGPT-2.0 by Alpha-VLLM

kandinsky-5 by kandinskylab

HunyuanVideo-I2V by Tencent-Hunyuan

image-gpt by openai

Open-Sora by hpcaitech

generative-models by Stability-AI