Discover and explore top open-source AI tools and projects—updated daily.
facebookresearchPixel embeddings power unified multimodal understanding and generation
Top 49.0% on SourcePulse
Tuna-2 introduces a simplified Unified Multimodal Model (UMM) architecture that leverages direct pixel embeddings, eschewing traditional vision encoders. This approach aims to enhance performance and streamline multimodal understanding and generation tasks, offering a more efficient alternative for researchers and practitioners.
How It Works
Evolving from Tuna, Tuna-2 progressively strips away visual encoding components. It first removes the VAE to create Tuna-R, a pixel-space UMM. Tuna-2 further simplifies by bypassing the representation encoder entirely, using direct patch embedding layers for raw image inputs. This pixel-embedding-first strategy is presented as a novel and advantageous design for achieving superior results on multimodal benchmarks.
Quick Start & Requirements
Installation is streamlined via git clone and bash scripts/setup_uv.sh, which creates a .venv with dependencies. Manual setup using the uv package manager is also supported. Key requirements include Python, uv, and PyTorch with CUDA 12.1 (cu121). Inference is managed through a single script (scripts/launch/predict.sh), supporting various tasks, model variants (Tuna-2, Tuna-R, Tuna), sizes (7b, 2b), and resolutions. Project page: https://tuna-ai.org/tuna-2. arXiv: https://arxiv.org/abs/2604.24763.
Highlighted Details
Maintenance & Community
The project is backed by a large team of researchers from Meta and academic institutions, suggesting strong foundational support. However, the README does not provide direct links to community channels such as Discord or Slack, nor does it detail a public roadmap.
Licensing & Compatibility
The project is licensed under the Apache License 2.0. This permissive license allows for broad compatibility, including commercial use and integration into closed-source projects.
Limitations & Caveats
Due to organizational policy, full production-trained model weights are not being released. A foundation checkpoint with some layers removed will be provided, necessitating fine-tuning to achieve full model quality. Video generation capabilities are also not released, although the codebase is available for users to train their own models.
1 week ago
Inactive
openai