Discover and explore top open-source AI tools and projects—updated daily.
KangLiao929Unified multimodal model for camera-centric spatial intelligence
New!
Top 95.3% on SourcePulse
Puffin is a unified multimodal model that integrates camera geometry to advance spatial intelligence in understanding and generation tasks. It targets researchers and engineers working with multimodal AI who require enhanced spatial reasoning and camera-aware capabilities. The framework offers benefits such as camera-controllable image generation, scene understanding from specific camera viewpoints, and spatial imagination.
How It Works
Puffin integrates camera geometry into a unified multimodal model, advancing spatial intelligence. Its core approach involves a multi-stage training strategy aligning vision encoders, LLMs, and diffusion models, followed by Supervised Fine-Tuning (SFT) and instruction tuning. This allows for novel camera-controllable generation and understanding by explicitly modeling spatial relationships and camera parameters, offering enhanced reasoning and generation capabilities.
Quick Start & Requirements
Installation requires PyTorch 2.7.0 and CUDA 12.6. Setup involves cloning the repo, creating a Python 3.10 conda environment, and installing dependencies via pip. Model checkpoints are available on Hugging Face. The project provides example scripts for camera-controllable image generation, scene understanding, world exploration, spatial imagination, and photographic guidance.
Highlighted Details
Maintenance & Community
The project builds upon several foundational models and libraries, including OpenUni, MetaQuery, Qwen2.5, RADIOv3, SD3, and GeoCalib. Specific community channels (e.g., Discord, Slack) or a detailed roadmap are not explicitly detailed in the README.
Licensing & Compatibility
The project is licensed under the NTU S-Lab License 1.0. Specific terms regarding commercial use or closed-source linking require consulting the full license text.
Limitations & Caveats
Camera maps for the Puffin-4M training dataset are omitted due to their substantial size but are generatable. The README indicates future releases for dataset construction pipelines and camera captions for large-scale datasets.
1 week ago
Inactive