Puffin by KangLiao929

Unified multimodal model for camera-centric spatial intelligence

Created 4 months ago

382 stars

Top 75.0% on SourcePulse

Project Summary

Puffin is a unified multimodal model that integrates camera geometry to advance spatial intelligence in understanding and generation tasks. It targets researchers and engineers working with multimodal AI who require enhanced spatial reasoning and camera-aware capabilities. The framework offers benefits such as camera-controllable image generation, scene understanding from specific camera viewpoints, and spatial imagination.

How It Works

Puffin integrates camera geometry into a unified multimodal model, advancing spatial intelligence. Its core approach involves a multi-stage training strategy aligning vision encoders, LLMs, and diffusion models, followed by Supervised Fine-Tuning (SFT) and instruction tuning. This allows for novel camera-controllable generation and understanding by explicitly modeling spatial relationships and camera parameters, offering enhanced reasoning and generation capabilities.

Quick Start & Requirements

Installation requires PyTorch 2.7.0 and CUDA 12.6. Setup involves cloning the repo, creating a Python 3.10 conda environment, and installing dependencies via pip. Model checkpoints are available on Hugging Face. The project provides example scripts for camera-controllable image generation, scene understanding, world exploration, spatial imagination, and photographic guidance.

Highlighted Details

Unified multimodal framework explicitly integrating camera geometry for spatial understanding and generation.
Offers three variants: Puffin-Base, Puffin-Thinking, and Puffin-Instruct.
Introduces Puffin-4M, a 449GB dataset of 4 million vision-language-camera triplets.
Supports advanced tasks including camera-controllable generation, cross-view understanding, spatial imagination, and photographic guidance.
Utilizes a multi-stage training strategy with custom benchmarks for evaluation.

Maintenance & Community

The project builds upon several foundational models and libraries, including OpenUni, MetaQuery, Qwen2.5, RADIOv3, SD3, and GeoCalib. Specific community channels (e.g., Discord, Slack) or a detailed roadmap are not explicitly detailed in the README.

Licensing & Compatibility

The project is licensed under the NTU S-Lab License 1.0. Specific terms regarding commercial use or closed-source linking require consulting the full license text.

Limitations & Caveats

Camera maps for the Puffin-4M training dataset are omitted due to their substantial size but are generatable. The README indicates future releases for dataset construction pipelines and camera captions for large-scale datasets.

Puffin by KangLiao929

Explore Similar Projects

prope by liruilong940607

Pixel-Reasoner by TIGER-AI-Lab

3D-R1 by AIGeeksGroup

ZenCtrl by FotographerAI

Lumina-mGPT by Alpha-VLLM

MotionClone by LPengYang

ComfyUI-qwenmultiangle by jtydhr88

WonderJourney by KovenYu

VisionLLM by OpenGVLab

Seed1.5-VL by ByteDance-Seed

Qwen-Image by QwenLM

Bagel by ByteDance-Seed