Thinking-with-Video by tongjingqi

Video generation as a multimodal reasoning paradigm

Created 8 months ago

311 stars

Top 86.4% on SourcePulse

Project Summary

Summary

This project introduces "Thinking with Video," a novel paradigm that leverages video generation for multimodal reasoning. It provides the VideoThinkBench, a benchmark designed to evaluate video generation models on both vision-centric and text-centric tasks. The research demonstrates that models like Sora-2 can achieve competitive performance, surpassing existing vision-language models on certain tasks and showcasing potential for unified multimodal understanding.

How It Works

The core innovation is the "Thinking with Video" paradigm, which utilizes video generation models to visualize dynamic processes, represent temporal evolution, and embed textual information within video frames. This approach aims to overcome the static limitations inherent in image-based reasoning and the modality separation found in traditional methods, enabling more human-like dynamic reasoning through generated video content.

Quick Start & Requirements

Installation: Clone the repository (git clone --recursive https://github.com/tongjingqi/Thinking-with-Video.git), navigate into the directory, create and activate a Python 3.12 Conda environment (conda create -y -n thinking_with_video python==3.12, conda activate thinking_with_video), and install dependencies (pip install -r requirements.txt).
Dataset Download: Benchmark datasets are available on Hugging Face (hf download --repo-type dataset OpenMOSS-Team/VideoThinkBench --local-dir VideoThinkBench). Datasets require unzipping using the provided unzip_dir.sh script. A "minitest" version is available for reduced evaluation costs.
Prerequisites: Python 3.12, Conda.
Links: Paper, VideoThinkBench Dataset.

Highlighted Details

VideoThinkBench: A comprehensive benchmark specifically designed for evaluating video generation models' reasoning capabilities. It includes vision-centric tasks (e.g., eyeballing puzzles, ARC-AGI-2, mazes) and text-centric tasks adapted from established benchmarks (e.g., MATH, MMLU).
Sora-2 Performance: Demonstrates strong performance, generally surpassing SOTA VLMs on eyeballing puzzles. On the VideoThinkBench (full test), Sora-2 achieves a 35.0% average on vision-centric tasks and 68.6% on text-centric tasks. It achieves 79.1% on MMMU. On the minitest, Sora-2 achieves a 31.6% average.
Unified Multimodal Reasoning: The paradigm shows potential for unified multimodal reasoning, with Sora-2 exhibiting strong performance on text-centric benchmarks by embedding text within video frames.
Few-Shot Learning: Sora-2 exhibits few-shot learning capabilities, particularly on tasks requiring pattern recognition from input-output pairs like ARC-AGI-2.

Maintenance & Community

The README does not provide specific details on community channels (e.g., Discord, Slack), active contributors, or a public roadmap.

Licensing & Compatibility

The project is released under the MIT License, which generally permits broad use, including commercial applications, with attribution.

Limitations & Caveats

While Sora-2 can achieve correct final answers on reasoning tasks, it struggles to generate coherent visual reasoning processes within the generated videos. The project's performance claims are based on specific evaluation methods (e.g., Major Frame for video generation models, Audio for text-centric tasks).

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days