NavGPT by GengzeZhou

LLM-based vision-and-language navigation agent

Created 2 years ago

304 stars

Top 88.2% on SourcePulse

Project Summary

NavGPT is an open-source implementation for a vision-and-language navigation agent that leverages Large Language Models (LLMs) for explicit reasoning. It targets researchers and developers working on embodied AI and aims to demonstrate the planning and commonsense reasoning capabilities of LLMs in complex navigation tasks without task-specific training.

How It Works

NavGPT utilizes a purely LLM-based approach for zero-shot sequential action prediction in vision-and-language navigation (VLN). At each step, it processes textual descriptions of visual observations, navigation history, and available directions to reason about the agent's state and predict the next action. This method allows for high-level planning, including sub-goal decomposition, commonsense knowledge integration, landmark identification, progress tracking, and adaptive plan adjustments.

Quick Start & Requirements

Installation: Create a conda environment (conda create --name NavGPT python=3.9, conda activate NavGPT) and install dependencies (pip install -r requirements.txt).
Data: Download R2R data and place it in the datasets directory. Preprocessing scripts are available in nav_src/scripts.
OpenAI API Key: Set your OpenAI API key as an environment variable (export OPENAI_API_KEY={Your_Private_Openai_Key}) or directly in your code.
Reproducing Results: Use GPT-4 with python NavGPT.py --llm_model_name gpt-4 --output_dir ../datasets/R2R/exprs/gpt-4-val-unseen --val_env_name R2R_val_unseen_instr. GPT-3.5 can be used for an economic test.
Custom LLMs: Add custom LLM repositories as submodules under nav_src/LLMs/.
Documentation: [AAAI 2024] Official implementation of NavGPT.

Highlighted Details

Demonstrates explicit high-level planning capabilities of LLMs in VLN.
Capable of decomposing instructions, integrating commonsense, identifying landmarks, and adapting plans.
Can generate navigational instructions and top-down trajectories from observations and actions.
Achieves zero-shot R2R task performance, though it falls short of trained models.

Maintenance & Community

The project is associated with AAAI 2024 and lists authors from the Australian Institute for Machine Learning and The Australian National University. Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.

Limitations & Caveats

Zero-shot performance on R2R tasks is noted to be below that of trained models. The project suggests adapting multi-modality inputs for LLMs and applying LLM reasoning to benefit learning-based models, indicating areas for future development.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days