PRISM-VL by kepengxu

Vision-language models reasoning from camera sensor measurements

Created 2 weeks ago

New!

424 stars

Top 69.0% on SourcePulse

Project Summary

PRISM-VL addresses the loss of critical sensor evidence in standard RGB images processed by image signal pipelines. By utilizing RAW-derived Meas.-XYZ inputs and camera metadata, it enables vision-language models (VLMs) to achieve more robust reasoning, particularly beneficial for researchers and engineers aiming to enhance VLM performance in challenging visual scenarios.

How It Works

This project shifts VLM visual input from post-ISP RGB to RAW-derived Meas.-XYZ channels augmented with camera metadata (e.g., ISO, exposure time, aperture). This approach preserves sensor evidence that is often discarded during RGB rendering, leading to improved performance on measurement-sensitive tasks like low-illumination recovery and scene text recognition.

Quick Start & Requirements

Install: Clone the repository and execute bash install_editable.sh.
Prerequisites: Requires CUDA-enabled hardware for model inference. Users must download the MeasL-Bench-V1 and MeasL-150K-V1 datasets from Hugging Face.
Resources: Setup involves downloading multi-gigabyte datasets and potentially large model weights.
Links: Project Page, arXiv, Demo Inference Guide, Evaluation Guide.

Highlighted Details

Achieves significant performance gains: PRSIMVL-8B improves over RGB Qwen3-VL-8B by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 LLM-Judge points on the MeasL-Bench.
Introduces MeasL-Bench-V1 (2,183 examples) and MeasL-150K-V1 (152,517 instruction-tuning examples) for measurement-grounded VLM evaluation and training.
Provides LoRA adapters for Qwen3-VL models (2B, 4B, 8B), demonstrating improvements across capabilities like HDR Evidence Recovery, Low-Illumination Evidence Recovery, and Scene Text Recognition.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or detailed maintenance information beyond author contributions are provided in the README.

Licensing & Compatibility

The MeasL-Bench-V1 and MeasL-150K-V1 datasets are licensed under CC BY-NC 4.0, permitting non-commercial research and education use only, with mandatory citation. The code license is not explicitly stated.

Limitations & Caveats

This is a research release focused on specific measurement-grounded VLM capabilities. The CC BY-NC 4.0 dataset license restricts commercial applications.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

425 stars in the last 17 days