Discover and explore top open-source AI tools and projects—updated daily.
LLM-based vision-and-language navigation agent
Top 95.7% on SourcePulse
NavGPT is an open-source implementation for a vision-and-language navigation agent that leverages Large Language Models (LLMs) for explicit reasoning. It targets researchers and developers working on embodied AI and aims to demonstrate the planning and commonsense reasoning capabilities of LLMs in complex navigation tasks without task-specific training.
How It Works
NavGPT utilizes a purely LLM-based approach for zero-shot sequential action prediction in vision-and-language navigation (VLN). At each step, it processes textual descriptions of visual observations, navigation history, and available directions to reason about the agent's state and predict the next action. This method allows for high-level planning, including sub-goal decomposition, commonsense knowledge integration, landmark identification, progress tracking, and adaptive plan adjustments.
Quick Start & Requirements
conda create --name NavGPT python=3.9
, conda activate NavGPT
) and install dependencies (pip install -r requirements.txt
).datasets
directory. Preprocessing scripts are available in nav_src/scripts
.export OPENAI_API_KEY={Your_Private_Openai_Key}
) or directly in your code.python NavGPT.py --llm_model_name gpt-4 --output_dir ../datasets/R2R/exprs/gpt-4-val-unseen --val_env_name R2R_val_unseen_instr
. GPT-3.5 can be used for an economic test.nav_src/LLMs/
.Highlighted Details
Maintenance & Community
The project is associated with AAAI 2024 and lists authors from the Australian Institute for Machine Learning and The Australian National University. Further community or maintenance details are not explicitly provided in the README.
Licensing & Compatibility
The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.
Limitations & Caveats
Zero-shot performance on R2R tasks is noted to be below that of trained models. The project suggests adapting multi-modality inputs for LLMs and applying LLM reasoning to benefit learning-based models, indicating areas for future development.
1 year ago
Inactive