Puffin  by KangLiao929

Unified multimodal model for camera-centric spatial intelligence

Created 4 weeks ago

New!

269 stars

Top 95.3% on SourcePulse

GitHubView on GitHub
Project Summary

Puffin is a unified multimodal model that integrates camera geometry to advance spatial intelligence in understanding and generation tasks. It targets researchers and engineers working with multimodal AI who require enhanced spatial reasoning and camera-aware capabilities. The framework offers benefits such as camera-controllable image generation, scene understanding from specific camera viewpoints, and spatial imagination.

How It Works

Puffin integrates camera geometry into a unified multimodal model, advancing spatial intelligence. Its core approach involves a multi-stage training strategy aligning vision encoders, LLMs, and diffusion models, followed by Supervised Fine-Tuning (SFT) and instruction tuning. This allows for novel camera-controllable generation and understanding by explicitly modeling spatial relationships and camera parameters, offering enhanced reasoning and generation capabilities.

Quick Start & Requirements

Installation requires PyTorch 2.7.0 and CUDA 12.6. Setup involves cloning the repo, creating a Python 3.10 conda environment, and installing dependencies via pip. Model checkpoints are available on Hugging Face. The project provides example scripts for camera-controllable image generation, scene understanding, world exploration, spatial imagination, and photographic guidance.

Highlighted Details

  • Unified multimodal framework explicitly integrating camera geometry for spatial understanding and generation.
  • Offers three variants: Puffin-Base, Puffin-Thinking, and Puffin-Instruct.
  • Introduces Puffin-4M, a 449GB dataset of 4 million vision-language-camera triplets.
  • Supports advanced tasks including camera-controllable generation, cross-view understanding, spatial imagination, and photographic guidance.
  • Utilizes a multi-stage training strategy with custom benchmarks for evaluation.

Maintenance & Community

The project builds upon several foundational models and libraries, including OpenUni, MetaQuery, Qwen2.5, RADIOv3, SD3, and GeoCalib. Specific community channels (e.g., Discord, Slack) or a detailed roadmap are not explicitly detailed in the README.

Licensing & Compatibility

The project is licensed under the NTU S-Lab License 1.0. Specific terms regarding commercial use or closed-source linking require consulting the full license text.

Limitations & Caveats

Camera maps for the Puffin-4M training dataset are omitted due to their substantial size but are generatable. The README indicates future releases for dataset construction pipelines and camera captions for large-scale datasets.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
273 stars in the last 28 days

Explore Similar Projects

Feedback? Help us improve.