Discover and explore top open-source AI tools and projects—updated daily.
MIV-XJTUVisual reasoning for autonomous driving trajectory planning
Top 69.0% on SourcePulse
FutureSightDrive (FSDrive) addresses the challenge of visual reasoning in autonomous driving trajectory planning. It introduces a spatio-temporal Chain of Thought (CoT) approach for end-to-end Vision-Language-Action (VLA) models, enabling them to "think visually" and unify generation with understanding using minimal data. This project targets researchers and engineers in the autonomous driving domain seeking to advance the field towards more sophisticated visual reasoning capabilities.
How It Works
FSDrive employs a spatio-temporal CoT mechanism integrated into an end-to-end VLA framework. This approach allows the model to process and reason about visual information over time for trajectory planning. The core advantage lies in its ability to unify visual generation and understanding tasks, achieving this with reduced data requirements and introducing visual reasoning as a primary capability for autonomous driving systems.
Quick Start & Requirements
Installation involves cloning the repository, creating a Conda environment with Python 3.10, and activating it. Key dependencies include CUDA 12.4 and specific PyTorch versions (2.5.1, torchvision 0.20.1, torchaudio 2.5.1 for cu124). The setup requires installing packages from the LLaMA-Factory subdirectory and the root requirements.txt. Data preparation involves downloading the nuScenes dataset, extracting visual tokens using provided MoVQGAN scripts, and constructing pre-training/fine-tuning datasets in the LLaMA-Factory format. Training is initiated via llamafactory-cli train commands using provided YAML configurations for pre-training and subsequent supervised fine-tuning (SFT). Inference, evaluation, and visualization scripts are also available.
Highlighted Details
Maintenance & Community
The project is associated with the authors of the NeurIPS 2025 spotlight paper. No specific community channels (e.g., Discord, Slack), roadmap, or ongoing maintenance signals are detailed in the README.
Licensing & Compatibility
The provided README does not specify a software license. This lack of explicit licensing information presents a significant ambiguity regarding its use, particularly for commercial applications or integration into closed-source projects.
Limitations & Caveats
The project requires a specific and relatively recent CUDA version (12.4), potentially limiting adoption on older hardware. The extensive data preparation steps and reliance on multiple complex codebases (LLaMA-Factory, MoVQGAN) may pose an integration challenge. The absence of a clear license is a critical adoption blocker.
1 month ago
Inactive
OpenDriveLab