tuna-2  by facebookresearch

Pixel embeddings power unified multimodal understanding and generation

Created 1 month ago
687 stars

Top 49.0% on SourcePulse

GitHubView on GitHub
Project Summary

Tuna-2 introduces a simplified Unified Multimodal Model (UMM) architecture that leverages direct pixel embeddings, eschewing traditional vision encoders. This approach aims to enhance performance and streamline multimodal understanding and generation tasks, offering a more efficient alternative for researchers and practitioners.

How It Works

Evolving from Tuna, Tuna-2 progressively strips away visual encoding components. It first removes the VAE to create Tuna-R, a pixel-space UMM. Tuna-2 further simplifies by bypassing the representation encoder entirely, using direct patch embedding layers for raw image inputs. This pixel-embedding-first strategy is presented as a novel and advantageous design for achieving superior results on multimodal benchmarks.

Quick Start & Requirements

Installation is streamlined via git clone and bash scripts/setup_uv.sh, which creates a .venv with dependencies. Manual setup using the uv package manager is also supported. Key requirements include Python, uv, and PyTorch with CUDA 12.1 (cu121). Inference is managed through a single script (scripts/launch/predict.sh), supporting various tasks, model variants (Tuna-2, Tuna-R, Tuna), sizes (7b, 2b), and resolutions. Project page: https://tuna-ai.org/tuna-2. arXiv: https://arxiv.org/abs/2604.24763.

Highlighted Details

  • Tuna-2, utilizing pixel embeddings, demonstrates superior performance over Tuna-R and Tuna on diverse multimodal benchmarks.
  • Supports flexible inference configurations including text-to-image (t2i) and image editing tasks.
  • Offers multiple model variants and sizes (e.g., 7b, 2b) to cater to different computational needs.
  • Accommodates a range of output resolutions, from 512x512 up to 1344x768.

Maintenance & Community

The project is backed by a large team of researchers from Meta and academic institutions, suggesting strong foundational support. However, the README does not provide direct links to community channels such as Discord or Slack, nor does it detail a public roadmap.

Licensing & Compatibility

The project is licensed under the Apache License 2.0. This permissive license allows for broad compatibility, including commercial use and integration into closed-source projects.

Limitations & Caveats

Due to organizational policy, full production-trained model weights are not being released. A foundation checkpoint with some layers removed will be provided, necessitating fine-tuning to achieve full model quality. Video generation capabilities are also not released, although the codebase is available for users to train their own models.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
10
Star History
687 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.