FSDrive by MIV-XJTU

Visual reasoning for autonomous driving trajectory planning

Created 7 months ago

560 stars

Top 57.2% on SourcePulse

Project Summary

FutureSightDrive (FSDrive) addresses the challenge of visual reasoning in autonomous driving trajectory planning. It introduces a spatio-temporal Chain of Thought (CoT) approach for end-to-end Vision-Language-Action (VLA) models, enabling them to "think visually" and unify generation with understanding using minimal data. This project targets researchers and engineers in the autonomous driving domain seeking to advance the field towards more sophisticated visual reasoning capabilities.

How It Works

FSDrive employs a spatio-temporal CoT mechanism integrated into an end-to-end VLA framework. This approach allows the model to process and reason about visual information over time for trajectory planning. The core advantage lies in its ability to unify visual generation and understanding tasks, achieving this with reduced data requirements and introducing visual reasoning as a primary capability for autonomous driving systems.

Quick Start & Requirements

Installation involves cloning the repository, creating a Conda environment with Python 3.10, and activating it. Key dependencies include CUDA 12.4 and specific PyTorch versions (2.5.1, torchvision 0.20.1, torchaudio 2.5.1 for cu124). The setup requires installing packages from the LLaMA-Factory subdirectory and the root requirements.txt. Data preparation involves downloading the nuScenes dataset, extracting visual tokens using provided MoVQGAN scripts, and constructing pre-training/fine-tuning datasets in the LLaMA-Factory format. Training is initiated via llamafactory-cli train commands using provided YAML configurations for pre-training and subsequent supervised fine-tuning (SFT). Inference, evaluation, and visualization scripts are also available.

Highlighted Details

Featured as a NeurIPS 2025 spotlight paper.
Pioneers "visual reasoning" for autonomous driving trajectory planning.
Unifies visual generation and understanding with minimal data.
Built upon foundational codebases including LLaMA-Factory, MoVQGAN, GPT-Driver, and Agent-Driver.

Maintenance & Community

The project is associated with the authors of the NeurIPS 2025 spotlight paper. No specific community channels (e.g., Discord, Slack), roadmap, or ongoing maintenance signals are detailed in the README.

Licensing & Compatibility

The provided README does not specify a software license. This lack of explicit licensing information presents a significant ambiguity regarding its use, particularly for commercial applications or integration into closed-source projects.

Limitations & Caveats

The project requires a specific and relatively recent CUDA version (12.4), potentially limiting adoption on older hardware. The extensive data preparation steps and reliance on multiple complex codebases (LLaMA-Factory, MoVQGAN) may pose an integration challenge. The absence of a clear license is a critical adoption blocker.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

60 stars in the last 30 days