NavGPT  by GengzeZhou

LLM-based vision-and-language navigation agent

Created 1 year ago
268 stars

Top 95.7% on SourcePulse

GitHubView on GitHub
Project Summary

NavGPT is an open-source implementation for a vision-and-language navigation agent that leverages Large Language Models (LLMs) for explicit reasoning. It targets researchers and developers working on embodied AI and aims to demonstrate the planning and commonsense reasoning capabilities of LLMs in complex navigation tasks without task-specific training.

How It Works

NavGPT utilizes a purely LLM-based approach for zero-shot sequential action prediction in vision-and-language navigation (VLN). At each step, it processes textual descriptions of visual observations, navigation history, and available directions to reason about the agent's state and predict the next action. This method allows for high-level planning, including sub-goal decomposition, commonsense knowledge integration, landmark identification, progress tracking, and adaptive plan adjustments.

Quick Start & Requirements

  • Installation: Create a conda environment (conda create --name NavGPT python=3.9, conda activate NavGPT) and install dependencies (pip install -r requirements.txt).
  • Data: Download R2R data and place it in the datasets directory. Preprocessing scripts are available in nav_src/scripts.
  • OpenAI API Key: Set your OpenAI API key as an environment variable (export OPENAI_API_KEY={Your_Private_Openai_Key}) or directly in your code.
  • Reproducing Results: Use GPT-4 with python NavGPT.py --llm_model_name gpt-4 --output_dir ../datasets/R2R/exprs/gpt-4-val-unseen --val_env_name R2R_val_unseen_instr. GPT-3.5 can be used for an economic test.
  • Custom LLMs: Add custom LLM repositories as submodules under nav_src/LLMs/.
  • Documentation: [AAAI 2024] Official implementation of NavGPT.

Highlighted Details

  • Demonstrates explicit high-level planning capabilities of LLMs in VLN.
  • Capable of decomposing instructions, integrating commonsense, identifying landmarks, and adapting plans.
  • Can generate navigational instructions and top-down trajectories from observations and actions.
  • Achieves zero-shot R2R task performance, though it falls short of trained models.

Maintenance & Community

The project is associated with AAAI 2024 and lists authors from the Australian Institute for Machine Learning and The Australian National University. Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.

Limitations & Caveats

Zero-shot performance on R2R tasks is noted to be below that of trained models. The project suggests adapting multi-modality inputs for LLMs and applying LLM reasoning to benefit learning-based models, indicating areas for future development.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.